BBC R&D work with the BBC World Service to create Radio Archive
With 50,000 radio programmes, spanning 45 years, it is always going to be a challenge to create a useful archive. But how do you archive programmes when you are not even sure what they are about? Tristan Ferne explains how the BBC Research & Development team overcame this exact challenge.
One of the major challenges of a big digitisation project is that you simply swap out an under-used physical archive for its digital equivalent. Without easy ways to navigate the archive there’s no way for your users to get to the bits they want. BBC R&D recently worked with the BBC World Service to generate metadata for their radio archive. First using algorithms to generate enough data to put the archive online and then using crowd-sourcing to improve it.
Starting in 2005 the BBC World Service digitised the contents of its “Recorded Programme” library. This contained over 50,000 radio programmes from the past 45 years covering a wide range of subjects from events in Sierra Leone as they happened to interviews with Stephen Spielberg. The digitisation was a great success but although we had all the programmes in high quality audio the descriptive metadata was of limited quantity and quality. Metadata is what describes digital items and without it nothing can be found. So although we might have a programme title and broadcast date we didn’t really know what each programme was about.
Our challenge was to generate data from what we did have – the audio from the radio programmes. We used an open-source speech recognition toolkit to listen to every programme and convert it to text. Speech recognition can be very good but on these radio programmes, with varying recording qualities and many voices and accents from around the world, it really struggled for accuracy. We ended up with lots of pretty errorful transcripts.
We then developed new algorithms that could extract keywords or tags from these noisy transcripts. We use Linked Data to provide unambiguous tags and every tag in our database is linked to a wikipedia page. For every programme, even if there was no metadata to start with, we generated around 20 new tags. This was a lot of media to process – about 26,000 hours of audio. Doing this processing on a single computer would have taken around 36,000 hours, but by using the cloud we processed it on many machines in parallel in two weeks for a cost of around $3000.
Crowd-sourcing to solve a problem
The computer-generated tags weren’t always correct and we couldn’t go through them all by hand, but they were good enough to bootstrap an online archive. And then we could ask people to help correct and add to these topics – crowdsourcing the problem.
We launched the prototype in 2012. As well as listening to programmes in the archive, users can view the automatic tags and vote on whether they’re correct or incorrect or add completely new tags. They can also edit programme titles and synopses, select appropriate images and name the voices heard.
Since we launched we have had over 70,000 tag edits from our users, 1300 text edits and 1600 voices labelled. About 26% of the available programmes have been tagged and 41% have been listened to at least once. We were even sent four “lost” programmes that listeners had taped off air!
Humans and computers can separately try to solve some of the problems of putting large media archives online but human cataloging is very expensive and time-consuming and computer-generated information can be error-prone. We believe that the combination of the two can be a more accurate, cheaper and scalable solution. We are now working on COMMA, a project to build a re-usable cloud platform for processing large media archives. Please get in touch to find out more.