´óÏó´«Ã½

« Previous | Main | Next »

Notes from the WWW 2012 conference

Post categories:

Yves Raimond | 09:36 UK time, Thursday, 26 April 2012

Last week I attended the in Lyon, France. This conference is probably the largest one in that space: around 2500 participants and 15 parallel tracks. I presented two papers:

  • in the which focuses on the automated tagging algorithm we mentioned earlier on this blog ();

  • in the demo track, which focuses on the various tools we built to process very large archives with this algorithm, and on applications we built with using the resulting tags ().

I also contributed to a panel with from , from the , and from MIT/W3C. The panel was entitled 'Microdata, RDFa, Web APIs, Linked Data: Competing or Complementary?' and was looking at publishing statistics for structured data extracted from the dataset and from a Yahoo! dataset to try and understand what format were used and for what use-case. One of the main message from this panel is that structured web data is - Yahoo! reports that 25% of all web pages contain and 7% contain .

WWW 2012, LDOW panel, day 1

From left to right, Peter Mika, Yves Raimond, Ivan Herman, Tim Berners-Lee (c) Inria / picture T. Fournier

I thought I would write my notes from the conference. Of course, I wasn't able to see everything so the selection of papers below just reflects the presentations I attended. Given the general quality of the papers, I strongly suggest going through .

Linked Data on the Web workshop

I spent the first day of the conference in the . A couple of personal highlight were the following papers:

  • . As more and more online services for Named Entity Recognition are available, the framework attempts to align them to provide a unified way of accessing their results as well as a way to compare them. It looks like most of them perform well in particular domains, and perhaps the best results could be obtained by combining several of them.
  • . This position paper describes how the work done by the could be used to express provenance as Linked Data. One interesting aspect was the application of 'follow-your-nose' principles to provenance data. Some data could be marked as derived from another set of data, identified by a URI. Getting that URI would also hold some derivation information, ultimately leading to a full provenance trail for any derived data. This would be very useful for scientific dataset, but also for news articles, weather reports, etc.
  • . This paper introduces the , defining a read-write Linked Data architecture, apparently already in use in some IBM products.
  • . This paper introduces the framework, a framework for pivoting selections of data (e.g. a list of countries) between web applications for data visualisation. Mashpoint looks like a very promising tool for data journalism.

AdMIRe and PhiloWeb workshops

On the second day I attended the and the end of the . The earlier focused on advances in Music Information Retrieval, while the latter focused on the intersection of and Philosophy.

  • . This paper uses content-based similarities between musical tracks to improve the quality of user tags on those tracks.

  • . This paper compares and combines a number of content-based features for the task of identifying different versions of the same musical work.

  • . This paper was particularly interesting in that it dealt with Mel-frequency cepstral coefficients-based sound classification, which we used in ´óÏó´«Ã½ R&D for a couple of projects. Most sound similarity metrics using aggregates of MFCCs assume that their distribution is homogeneous. However for a wide range of sounds the MFCCs distribution fits a shifted power-law distribution, which means that very few selected frames can be used to obtain similar performances. Perhaps using similarity measures which do not assume homogeneity could help take such biases towards particular combinations of coefficients into account?

  • 's keynote described the project. Two particularly interesting aspects of the project are that it focuses solely on non-Western music and that it contributes directly to making better, a bit like what we do for the ´óÏó´«Ã½ Music website.

    • . This paper describes a very large-scale dataset for the evaluation of music recommendation algorithms, providing a wide range of data about a million songs.

I arrived quite late at the PhiloWeb workshop, but early enough to see a presentation about , which provides most of the logical framework behind languages such as RDF. The workshop ended with a panel of the discussing various philosophical aspects of the Web. One of the biggest issue raised was the huge discrepancy between the 'normal' use of the Web (asynchronous JavaScript everywhere, many resources to construct any single web page) and the Semantic Web or 'purist' view of the Web.

Main conference - day 1

The main conference started on the Wednesday with a very inspiring . He tackled a number of very interesting topics, such as the when designing new languages, the need for open mobile web applications and the issues around hierarchical systems such as and . He finished his keynote by talking about what he called the 'three sides of privacy': personal data held by businesses, personal data leaks (and the so-called 'jigsaw effect') and privacy invasion (e.g. through ). He concluded by asking the audience to spend 90% of their time building new things, but 10% of their time protecting the open Web infrastructure and .

I attended the demo sessions all afternoon, where I was presenting our automated tagging framework. The held the keynote of this session, describing the work they have been doing capturing a number of artworks from an international selection of museums. They demonstrated the ability to look at specific parts of artworks in detail, their 'street view' for museums, and the creation of personal collections of artworks. They also mentioned that an API to access the data will be opened - we'll certainly keep an eye out for that! also presented their personalised newscasts use-case within the project in the same session. They also presented some archive-related work, trying to help journalists find information in the news domain from their archive.

Main conference - day 2

Thursday started with a (), who was part of the team behind which won the quiz programme last year. A part of his keynote was spent describing the approach used for Watson, which is quite different from the traditional approach for automated question-answering. Typically a question is translated into some formal language and the resulting query is executed on a large knowledge base. Watson never tries to understand the 'meaning' of the questions. Rather, it finds documents that could hold the answer and scores them on lots of dimensions. Then, it learns the best combination of those scores based on previous Jeopardy! games. Semantic technologies in Watson are just used for some of these scores, not as a goal in itself. However it is an important tool, as it does bring a 10% performance boost.

This keynote was followed by a , introduced by a keynote by from the European Commission. The panel was very good, with a lot of controversial questions being tackled, like in France.

In the afternoon I attended the Entity Linking session. The was presented first, describing a Named Entity Recognition technique using as target identifiers. Candidate entities are generated, and disambiguated using a number of features, e.g. link probability (estimated using count information in the dictionary), semantic associativity (using the Wikipedia hyperlink structure), semantic similarity (derived from the YAGO taxonomy) and topical coherence of a document around the candidate entity. The approach was interesting, but the paper suggests that a big part of the algorithm relies on concepts extracted by and providing some context for the disambiguation. It wasn't clear how LINDEN compares with that tool and whether it actually improved the results first obtained by Wikipedia-Miner.

The second paper was about generating . A significant number of Wikipedia pages are lacking cross-lingual links, as everything is currently done manually. The algorithm presented in this paper exploits the fact that articles linked to or from equivalent articles tend to be equivalent.

The final paper of the session was , using probabilistic reasoning to combine automated and manual work (done through , which came up a lot during the conference for user evaluations) for an RDFa enrichment task.

The last session I attended that day was specifically about Semantic Web technologies, describing why and that their semantics need to be changed (which also got the best paper award at the conference), (which addressed this problem in a very different way to what IBM Watson is doing, translating full text queries in SPARQL queries), and .

Main conference - day 3

I attended the EU track on the Friday morning, where current EU projects were showcased, including (tracking entities through time in Web archives) and (making use of the social web for identifying Web documents to archive).

Finally, I attended the Web Mining session in the afternoon. This session included three very interesting papers. The first one started from the basis that 'real stories are not linear' and described an algorithm for for news stories. The second one tried to address the ambitious goal of . Their system gathered a wide range of Linked Data and news article, extracted causal links from different events described within them, and tried to generalise such causal links. Then, given a particular event input, these generalised links can be used to predict future events, e.g. "China overtakes Germany as world's biggest exporter" is used by their system to predict "wheat price will fall". The last paper mined the Google news archive, holding several articles per day since 1895, and derived statistics about . Apparently, the median duration of a person being famous in the news has consistently been 7 days for the last century.

Comments

Ìý

More from this blog...

´óÏó´«Ã½ iD

´óÏó´«Ã½ navigation

´óÏó´«Ã½ © 2014 The ´óÏó´«Ã½ is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.