Rapid search of media archives using subtitle keyword search and other techniques.
Project from 2011 - present
What we are doing
Finding is easy, but finding content within those shows is hard.
大象传媒 Snippets allows users to quickly search the content of programmes using precise keyword searches. Results can be filtered by genre, date and other . When you've found the programme you want, you can move around it using 'clickable transcripts' (click on a word to jump to that point in the video) and a range of visual navigation tools.
Why it matters
Opening up the the is one of the biggest challenges we face.
Even when digitisation is complete and rights have been cleared, the problem still remains: how do you find what you're looking for amongst hundreds of thousands of hours of programmes?
This is our challenge. We believe that the 大象传媒 Archive will only achieve its full potential when the task of finding what you want has been reduced to a few seconds effort.
Our Goals
Our first goal is to explore and develop automated methods for producing high quality metadata. We're focusing our effort on seven key data types.
- Programme details (Title, description, transmission dates, genre, format etc.)
- Full Transcript (time aligned to the second, with full speaker diarisation)
- Intelligent Chapters (with key terms and concepts extracted and linked in to useful resources like Wikipedia)
- Actor/Presenter/Contributor segmentation down to the level of the individual scene
- Geolocation of scenes
- Meaningful objects tagged or identifiable via search
- Background music identified
Our second goal is to develop easy-to-use tools to navigate this metadata. Because the output of automated metadata generation is often inaccurate (think speech-to-text), the tools must be designed with this inaccuracy in mind.
Outcomes
Snippets is being currently being trialled by a number of production teams in 大象传媒 Vision and 大象传媒 News. In addition it's being looked at by and Media Monitoring services authorised by the .
It's also being considered as one of the components of the 大象传媒's Research & Education Space initiative. In addition we're working with a number of universities on image recognition projects using the Snippets Framestore. We welcome any enquiries from academics.
How it works
Snippets is built on the video archive which contains over 300,000 大象传媒 TV and radio programmes available in a variety of formats. Because Redux captures raw Freeview we get the subtitle data for each programme. This is is extracted using software and indexed in a database.
Snippets then matches the Redux programme to its equivalent broadcast in the 大象传媒's . This cross-matching gives us additional metadata such as genre, format and cast info.
Programmes are then run through to produce a series of 480 x 270 pixel screengrabs at a granularity of 1 per second. A variety of visual scanning tools use these screengrabs to help. As we have over 500,000,000 images containing a large number of faces, objects and landmarks these grabs provide a rich dataset for research.
These three elements are the basic building blocks of the Snippets web app. We've also developed tools for snipping, sharing and transcoding and built APIs to many of these functions.