en:lang="en-US"
1
https://www.panoramaaudiovisual.com/en/2022/10/25/subtitulado-automatico-bilingue-la-idea-es-sencilla-la-solucion-no-tanto/

RTVE Bilingual Automatic Subtitling

Offering the viewer automatic live subtitling is not new, but can artificial intelligence technologies be trusted enough to subtitle programs in two languages? Carmen Pérez Cernuda, deputy director of the innovation and technological strategy area at RTVE, sheds light on this topic.

Thanks to the extensive territorial structure With RTVE, which has a center in each of the autonomous communities, it is possible to bring citizens closer to the news that occurs in their immediate surroundings. To this end, in addition to radio spaces, TVE produces two daily news reports in each of its territorial centers, which are broadcast simultaneously via DTT through territorial disconnections, one during the morning and another at the beginning of the afternoon.

And it is precisely this simultaneity which complicates the subtitling of these news items by the traditional methods with a reasonable budget. For this reason, historically they were subtitling only the territorial news of the Canary Islands, Catalonia and Madrid since, being Production Centers, there is a manual or semi-automatic subtitling system used for the rest of the programs.

First steps in automation

The idea of automate subtitling It arose soon, but the challenge was not trivial since, and as a small aside for those less familiar with the subject of subtitling, among whom I confess that until recently I also found myself, I will say that there is a strict regulations (UNE 153010/2012, as well as the code of good practices of the Spanish Center for Subtitling and Audio Description, CESyA) which broadly and very specifically defines a multitude of parameters such as: subtitling density (as a percentage of everything talked about in the program), maximum delay between word and text, maximum number of letters per line and lines, minimum and maximum stay time on text screen, subtitle location on screen, etc. that must be fulfilled to ensure understanding and follow-up of what was spoken.

After some concept tests and adjustment periods, a automatic subtitling service (…) . Today, this is a consolidated service with quality levels by even above what was expected.

Hence, it was not until 2018, when the state of the art of speech technologies and artificial intelligence applied to natural language processing reached a degree of maturity that encouraged us to venture into the automatic subtitling of news programs live with certain guarantee of success. These program features add some difficulties to automation; Being informative means that there are times when the speakers are not professionals and the sound intake It is not carried out in the best environmental conditions, microphone position, etc. Furthermore, since it is live, it implies that there is a short period of time, just a few seconds, for the obtaining and presenting the subtitle on the screen.

After some concept tests and adjustment periods, a automatic subtitling service whose quality levels were at least equal to the rest of the subtitling systems used up to that point. Today, this is a consolidated service with quality levels by even above what was expected.

Automatic Bilingual Subtitling Workflow - RTVE

And why not in bilingual centers?

However, at that time the lack of language models in other languages ​​spoken in Spain and the added difficulty of language changes made it impossible to expand the service to bilingual news programs.

After some failed attempts, finally, in 2020, through a public tender, we obtained a company capable of generating a service like the one we demanded in Spanish and the languages ​​spoken in Navarra, P. Vasco, Balearic Islands, C. Valenciana and Galicia, so that the generated subtitle would be written in the same language in which it was being spoken.

As in the case of the news in Spanish, it was proposed from the beginning that the service would be in cloud, so that RTVE delivers the audio signal of the Territorial News, in baseband through the AES3 digital audio interface, in the very Center where it is produced. The company in charge of the service performs necessary processing for the generation of subtitles automatically, in real time and with spoken language recognition, relying solely on the live sound of the program, since it does not have the aid of previous information systems, such as news scripts, rundowns, etc.

The subtitles generated for all Centers are delivered in the Torrespaña CPP (Madrid), in format DVB sobre IP, for incorporation into the DTT plot.

The Territorial Centers that use this new system are those of the Valencian Community, the Balearic Islands, Galicia, the Basque Country and Navarra, which were incorporated in stages, at a rate of one each month, since February 2021, once the results have been validated in each case through the corresponding quality control.

An automated alarm system allows you to have knowledge of any errors throughout the entire service chain, which, in addition, in the event of a malfunction, also disconnects the subtitling equipment. The service is provided Aicox as an integrative company applying technology Etiqmedia for processing and generating subtitles.

Automatic Bilingual Subtitling - RTVE generation process

How is bilingual subtitling generated?

For each Center, the solution has two processing systems running in parallel, one for each of the two languages ​​spoken in each news report. In this processing, it is transcribed voice to text, you go through the dictionary, the capitalization and punctuation module, the number presentation module and others that apply rules to improve some errors, always keeping in mind that in all these phases you have to be very careful with delays that they introduce when it comes to live subtitling. Aspiring to very good quality in any of them means add seconds that greatly penalize the user experience may fail to comply with the regulations.

In addition, and as a fundamental piece of the system, there is the spoken language detection module, which, taking into account the acoustic characteristics and applying technologies based on neural networks, in only five seconds you have to decide if the language being spoken is A or B, thus selecting in which language the subtitles are presented at all times. Likewise, the fact of being “live” conditions the parameters that can be adjust on the detector to improve its performance.

Not all bilingual news programs are the same…

Although the structure of the territorial news, in terms of content is the same in all centers (headlines, edited pieces, some live shows, time, culture and sports), with regard to language They do not follow any common pattern, which causes the results of automatic subtitling to be inconsistent.

Thus, in some centers, such as Navarra and Basque Country, all the information is done in Spanish except a summary of the news at the end of it that is given in Basque; What's more, in the case of Navarra, it is only done in the afternoon news.

One of our concerns is power distinguish errors attributable to the language detector and those who are language model

In others, practically all the news is discussed in the community language and they only switch to Spanish when there is some intervention of public figures, street surveys, etc. In the middle ground would be the news programs that, although the common thread is in a only language, each piece or statement can be in one or the other depending on the author of it.

It also often happens that, when speaking in a language, some words are said in the other language. This occurs, although not only, with entity names (organizations, localities...), a situation that naturally adds certain degree of difficulty for the language recognizer.

In the case of Galicia, where is spoken Spanish with a strong accent in Galician, the language detector, which works with phonemes, has many difficulties in distinguishing when a language change occurs, especially in the transition from Galician to Spanish. However, in Navarra, where is the Basque which is spoken with a marked Castilian accent, the system has not been able to recognize the language change. To alleviate the situation in this specific Center, given its casuistry, we are working so that the language change is done by resorting to a burst detection module.

SubtitlesMonitoring and quality parameters

Another key piece of the project is the exhaustive quality control that is carried out and whose results serve not only to know the quality of the solution but also, by detecting weak points, contribute to improving the functioning of the tool and therefore the quality levels obtained.

For this, a specialized company, They fit, which has experts in all the languages ​​covered, analyzes two five-minute fragments of the news weekly of each Territorial Center in which both the day of the week and the time of the news are varied: at the beginning, in the middle or at the end, carrying out for each fragment a set of objective measurements, collected in a weekly report that also includes the most notable errors that have been detected.

One of our concerns is power distinguish errors attributable to the language detector and those who are language model. To know exactly the quality of the language model, some premises have been established, such as not taking into account the first five seconds every time there is a language change (we remember that it is the window that has been established for the detector to decide which language is spoken and therefore contains errors). On the other hand, those affected words when there is a language change not detected or added by the system.

To know the quality of the transcription, differentiating for each of the languages, the error rate per word (WER) which takes into account the words added, deleted or wrongly transcribed against the total number of real words. There is also a precision calculation, which in addition to the previous errors, takes into account punctuation and capitalization errors.

Regarding the operation of the language detector, undetected changes and those that the system has considered a language change without actually existing are taken into account compared to real changes, analyzing separately the errors in the change from Spanish to the other language and the opposite.

Also, on the same samples, a Tracking the time it takes for subtitles to appear on the screen since the audio was heard.

SubtitlesSome results

In general, we have observed that the best results are obtained when analyzes the beginning of the news, the worst correspond to the fragments from the end of it, while, when the central part is analyzed, the results vary greatly depending on the content of these fragments. This is expected behavior, since the beginning of the news corresponds to the reading by a professional of a previously written textTherefore, structured language and on a set, that is, with good audio capture, while the fragments in the central part tend to be interventions with natural language, sometimes from the street, by non-professional speakers and where the acoustic conditions are worse. The final part of the news normally corresponds to the weather, sports and culture where they appear very frequently. local proper names, with infrequent occurrence, in which these systems are less effective. There may be a difference of 2 to 5 points in the WER error rate between the beginning and the end of the news.

For Spanish, the same language model, trained with thousands of hours, for all communities without the results being homogeneous. The best results are obtained in Navarra and P. Vasco, where in more than 90% of the measurements carried out, a WER less than 8% is obtained even in the most complex parts of the news. The Valencian Community obtains a WER below 10%, while Galicia and the Balearic Islands have a very irregular behavior and sometimes, always speaking of the fragments analyzed, there are so few words in Spanish that it is not possible to make a reliable WER calculation in this language.

As for the rest of the languages, these results are obtained: Basque, the WER remains below 15% in 90% of the samples. Valencian Community, WER less than 25% if it is the final part and 15% if it is at the beginning; Galicia, WER less than 20% in the final part and 15% in the rest; and the Balearic Islands, less than 20% in the final and initial part and very irregular in the middle of the news.

Regarding the precision, the largest number of errors is in the capitalization and punctuation, fluctuating between 40 and 50% as well as in incorrectly transcribed words, between 25 and 35%. The lost words are far behind, around 10%, with the added words being practically insignificant.

The language detector has a unequal behavior in the different languages ​​with different results if the transition is from Spanish to the local language, than if it is in the opposite direction, affecting its functioning also when, in a language change, the fragment spoken in the language to which it has been changed is only a few seconds.

Regarding the screen presentation time, the maximum of which is set at 8 seconds and, although at the beginning of the project it was quite close to this figure, it has been improving and currently we are among the 5 and 6 seconds on average.

What can we expect

Los neural network systems applied to this type of use cases have meant a spectacular improvement in the results obtained with respect to other previous technologies, however, they have the counterpart that they need large amounts of data for your training. One of the most important problems in the languages ​​​​treated, with the exception of Spanish, is that there is very little data to train. Therefore, unequal results were already expected, depending on the language, since some language models could be more or less trained than others, based on work or assignments previously provided in said languages, open corpus publications, etc.

On the other hand, and as it is live programs, some improvement formulas such as introduction of post-transcription rules They can only be applied when they are very simple because, otherwise, it penalizes the delivery time of the subtitles.

We trust that the computational capabilities will continue to increase allowing the introduction more complex formulas for the improvement in precision and the presentation time on screen.

Aside from the advancement of language models in different languages, which will undoubtedly occur with the increase in the use of systems of this type for various applications that will lead to more and more hours being available for training, we hope that both capitalization and scoring with new technologies based on Transformers, like others applied to language detection, achieve an appreciable increase in the quality of the subtitles obtained.

On the other hand, we trust that the computational capabilities will continue to increase allowing the introduction more complex formulas for the improvement in precision and the presentation time on screen.

All the reasons given but, above all, the approach of some associations of deaf people to congratulate us because for the first time they can follow a news program in their native language, they make the entire Effort invested in this project has been worth it and gives us all the necessary motivation to continue betting on it.

Carmen Perez Cernuda

Carmen Perez Cernuda

Deputy director of the innovation and technological strategy area at RTVE

By, Oct 25, 2022, Section:Automation, Television, Grandstands

Other articles about , , ,

Did you like this article?

Subscribe to our NEWSLETTER and you won't miss anything.