Synthetic voices study: How do you feel about an artificial announcer with an accent?

Barbara Zambrini

Senior Producer

Published: 13 February 2020

Back in January 1927, on the perception of voices on radio to understand how people responded to disembodied voices. Voices were presented on all 大象传媒 radio stations, and 5,000 people provided feedback using .

Ninety years later, how do audiences in 2020 feel about the use of synthetic voices in different types of media content ranging from national news to entertainment? Do regionality and gender affect the way 大象传媒 content can be perceived if delivered by synthetic voices? How would that make people feel?

大象传媒 Research & Development have just launched an online study called Synthetic Voice and Personality, which tests several bespoke synthetic voices with British regional accents on a wide public audience.

The study explores the ways synthetic voices could be used in different media contexts in the future and is part of our ongoing research into new forms of content driven by synthetic voices. It is a collaboration between 大象传媒 R&D, the , 大象传媒 Science, 大象传媒 Radio 4 and with the expertise of the 大象传媒鈥檚 Voice + AI team.

We consider this a follow up to Professor Pear鈥檚 original experiment and its results will be covered in a 大象传媒 Radio 4 programme later this year, which will revisit the two tests 90 years apart. To our knowledge, there are no other studies of this scale on the perception of regional accents in relation to synthetic voices. There are only a few published scientific studies into the perception of synthetic voices out there 鈥� none on UK accents.

The study will run for another few weeks, during which time participants can listen to a range of audio samples from male and female synthetic voices we created solely for this study, with a range of regional accents from across the UK.

Research objectives

The study will answer some research questions around the voices people prefer in the examples we present to them. We are working closely with from the Department of Acoustic Engineering at Salford University to ensure the study is academically rigorous.

With this study, we want to explore the following:

Regional accents
Tone of voice
Context of use (the type of content attached to specific voices)
Perception of synthetic voices (what people think and how it makes them feel)

This experiment is part of broader research we in 大象传媒 R&D are conducting on new forms of interactive conversation and voice experiences. The data gathered from the study will be analysed, and the insights will form the basis of the Radio 4 programme that will be aired in summer.

As with all new R&D research, this is meant to start digging into an area to hopefully find a gold nugget - there is no expectation those insights will all lead to drastic changes. It is about sharing insights and making people aware of users' feedback. However, the 大象传媒鈥檚 Voice + AI team 鈥� who are currently building the 大象传媒鈥檚 voice assistant 鈥� are taking a keen interest in this study, and will be looking at how the results might inform how the 大象传媒 builds its voice services in the future.

Creating the voices - the process

R&D collaborated with 大象传媒 staff from all regional radio stations, local news teams, and the technology division to find volunteers with distinctive regional accents willing to record their voice for us. For the purpose and feasibility of the study, 12 regions were chosen, with a male and female option for participants to choose. As a result, we generated 24 synthetic voices.

Participants will have access to pre-recorded audio files of each of those 24 voices as part of the online study - where it will not be possible to modify or synthesise speech in real-time.

Design approach

We aimed to design a compelling experience that allows participants to interact with the synthetic voices. During the study, users listen to different voices with a variety of accents, including ones that are similar to their own. We ask a series of questions to determine the voice they would prefer in different contexts. For example, would they prefer a voice similar to their own to read the local news? This proved to be an interesting design challenge as the voices that we present to participants need to be randomised throughout the study.

We also designed the study to ensure people with visual impairments can take part and should be completed in 10 to 15 minutes.

Technical approach

For the creation of the synthetic voices, we used an open-source speech-to-text machine learning model - a modified version of DC TTS, which is derived from the paper ''.

We have used 大象传媒 subtitles to create a phonetically balanced text corpus, a specially designed set of phrases covering the majority of phonemes and phoneme combinations in the English language. This acted as a script for the people who recorded their voices for us.

The original audio of each voice took an average of 3 hours for the person to read out a script of 22,915 words. The recording of each person was used as data to train a machine learning model. This is a demanding computational task 鈥� it takes around 16 hours to generate a synthesised voice that can then be used to generate new utterances using that voice. There is some post-processing done on the voice recordings to make them sound less metallic/robotic and also to remove some other audio artefacts, as well as EQ.

We wanted to explore ways for creating a diverse set that is faster, cheaper and with a reasonable level of quality. It is a future-facing proof of concept demonstrating the level of quality that can be achieved in a very short amount of time.

We took subtitles in the English language from the 大象传媒 archives and automatically transcribed them phonetically using the . From that, we were able to work out all the common combinations of 鈥� the different sounds made when pronouncing words 鈥� that would need to appear in a text intended to be recorded as training for a synthetic voice.

We then searched for each combination of phonemes in the subtitles from the 大象传媒 archive, identifying a sentence where they appear which we added to a script. In total, that gave us a script of a little over 1000 sentences that cover the most common sounds in the English language proportionally. Each of our contributors needed to read this corpus for us to be able to make their synthetic voice counterpart say anything.

The system was originally trained on a "base" voice with a huge corpus (approximately 24 hours of voice recordings). Then for each voice that we added, we were able to use a smaller audio sample (approximately 3 hours) of the specially designed corpus created by R&D. This means each of our synthetic voices could be trained in a shorter amount of time (but also takes less time to record), achieving a form of transfer learning.

October 2020 update

Since the launch of the study back in February, we have been looking back at the process of creating it and feel there is value in sharing the journey and our learnings. We want to let people have a peek behind the scenes to get a better understanding of all the considerations made in the process of creating those voices and the online study itself.

of creating the voices, the challenges and the UX solutions we found. Our findings describe why we wanted to do this study and why we created it this way. We describe the tools available on the market, the pros and cons and the reasoning behind the choices we made.

The series takes you through the project from start to finish, balanced between going deep into the technical detail (an important aspect of the process), and yet, is still an engaging and captivating read for people who aren't technically savvy.

The articles also point to the repository where we released our code for the wider tech community to benefit from it.

These are learning resources for anybody interested in creating artificial voices - or simply curious to learn more about what a synthetic voice actually is. Regardless of how tech-savvy you are, there is something to be learnt here.

What next?

This study is not the end of our work in this field 鈥� it builds onto our existing knowledge and expertise, and the technical and UX work is a good foundation for the future. 大象传媒 will contribute to the wider literature in this field as there is only a small amount of published work on HCI and regional accents in different countries. Additionally, as previously mentioned, the 大象传媒 Voice + AI team will be looking at the results of the study to see how they might inform our voice products and projects in the future.

大象传媒

Accessibility links

Barbara Zambrini

Rebuild Page

Useful links

Demo mode

Theme toggler

大象传媒