大象传媒

Muddy Boots

Jonathan Austin | 12:00 UK time, Wednesday, 10 December 2008

We've been experimenting with the using a prototype called . The question that we're trying to answer is: Can a computer reliably identify the people and organisations in news stories? This is still work in progress, but we have a prototype and an API that you're welcome to explore.

When a journalist refers to someone in a news story they usually give the person's full name and enough information so the reader can understand who they are talking about. If the full name is ambiguous they may have to add a title or give an explanation about who the person is. But sometimes, especially for household names, the reader is expected to infer the identity of the person from the context of the story and by applying a reasonable level of background knowledge.

Whilst a human reader takes for granted their abilities to pick up journalists' cues and understand context, a computer has to be programmed explicitly. It is difficult to design a system that can identify people from text and disambiguate them. It is even harder to build a system that meets editorial standards of accuracy. However, in theory, it should be possible. So we've been experimenting to develop an approach that could lead to a system that reliably identifies people (and organisations) in stories and marks up their textual names with semantic information. There are four key challenges:

Build working prototypes
Write tests for the prototypes that express editorial standards
Refine the prototypes to reach defined levels of reliability
Express the information usefully through semantic mark-up

Prototypes

The prototypes are ready to share with you. They have been built for us by a company called based in Sheffield. They were a successful participant in the .

There are two systems available. They are both based on (the structured version of Wikipedia) which provides the controlled vocabulary of people and organisations. Therefore, in these prototypes each person in a news story is described by their Wikipedia entry. Potentially, Wikipedia is a good controlled vocabulary source for news because it has wide scope, is open and dynamic. It is certainly useful for prototyping.

The first method is called "Muddy". It works by extracting proper names from the story text and then matches them to entries in DBpedia. If a term is ambiguous, the system uses various strategies based on Wikipedia's disambiguation pages and the structure of DBpedia to resolve the conflict. More information can be found on Rattle's website
The second method is called "conText". It was initially proposed by Chris Sizemore and is described in detail in his blog post here. This method uses search technology (Google and Lucene) to enhance the results further.

The good news for anyone who is not an expert in term or knowledge extraction is that Rattle implemented both methods behind a common abstract . In effect we can treat both methods like black boxes. We don't need to know how they work to use them and evaluate their ability to identify people.

In addition, Rattle implemented some visualisations so that we can get a feel for how the systems work. Below are some sample stories that have had people and organisations identified. You can also submit additional stories by following the final link.

Testing

It doesn't take long to see that neither prototype is perfect. Sometimes they miss people and sometimes they get them wrong. But that is the point of the research. How good are they really and can they be improved? Our next step is to measure them against our editorial standards.

So currently we are working with another Innovation Labs entrant to develop some tests. We're going to compare the performance of both systems (and any system that implements Rattle's API) to the performance of human beings. We will also be proposing measures that evaluate the systems from an editorial point of view. For example, is it editorially more acceptable for the system to fail to spot the name of cat owner whose cat gets stuck in a tree than the Prime Minister? And what should the system do when the name of that cat owner is Gordon Brown?

We will post more about this and our initial findings in the New Year, but in the meantime we'd like to hear your thoughts and feel free to have a look at the API and the prototypes.

Bookmark with:
|
|
|
|
- What's this?

Muddy Boots

Comments

About this blog

Subscribe to Journalism Labs

Latest: Reporters' blogs

Latest from 大象传媒 News blogs

大象传媒 News blogs

External Labs

More from this blog...

Topical posts on this blog

Archives

Latest contributors

大象传媒 navigation

大象传媒 links

大象传媒

Muddy Boots

Comments

About this blog

Subscribe to Journalism Labs

Latest: Reporters' blogs

Latest from 大象传媒 News blogs

大象传媒 News blogs

External Labs

More from this blog...

Topical posts on this blog

Archives

Latest contributors

大象传媒 iD

大象传媒 navigation

大象传媒 links