Modeling and improving text stability in live captions – Google Research Blog

habibrehman.shaikh.3

1 year ago

Posted by Vikas Bahirwani, Analysis Scientist, and Susan Xu, Software program Engineer, Google Augmented Actuality

Computerized speech recognition (ASR) know-how has made conversations extra accessible with reside captions in distant conferencing software program, cell purposes, and head-worn shows. Nevertheless, to keep up real-time responsiveness, reside caption methods usually show interim predictions which are up to date as new utterances are acquired. This will trigger textual content instability (a “flicker” the place beforehand displayed textual content is up to date, proven within the captions on the left within the video under), which may impair customers’ studying expertise resulting from distraction, fatigue, and problem following the dialog.

In “Modeling and Enhancing Textual content Stability in Dwell Captions”, offered at ACM CHI 2023, we formalize this downside of textual content stability by a couple of key contributions. First, we quantify the textual content instability by using a vision-based flicker metric that makes use of luminance distinction and discrete Fourier rework. Second, we additionally introduce a stability algorithm to stabilize the rendering of reside captions by way of tokenized alignment, semantic merging, and clean animation. Lastly, we carried out a consumer research (N=123) to know viewers’ expertise with reside captioning. Our statistical evaluation demonstrates a powerful correlation between our proposed flicker metric and viewers’ expertise. Moreover, it exhibits that our proposed stabilization methods considerably improves viewers’ expertise (e.g., the captions on the best within the video above).

Uncooked ASR captions vs. stabilized captions

Metric

Impressed by earlier work, we suggest a flicker-based metric to quantify textual content stability and objectively consider the efficiency of reside captioning methods. Particularly, our purpose is to quantify the glint in a grayscale reside caption video. We obtain this by evaluating the distinction in luminance between particular person frames (frames within the figures under) that represent the video. Giant visible modifications in luminance are apparent (e.g., addition of the phrase “shiny” within the determine on the underside), however refined modifications (e.g., replace from “… this gold. Good..” to “… this. Gold is good”) could also be troublesome to discern for readers. Nevertheless, changing the change in luminance to its constituting frequencies exposes each the plain and refined modifications.

Thus, for every pair of contiguous frames, we convert the distinction in luminance into its constituting frequencies utilizing discrete Fourier rework. We then sum over every of the high and low frequencies to quantify the glint on this pair. Lastly, we common over the entire frame-pairs to get a per-video flicker.

As an illustration, we will see under that two an identical frames (high) yield a flicker of 0, whereas two non-identical frames (backside) yield a non-zero flicker. It’s price noting that increased values of the metric point out excessive flicker within the video and thus, a worse consumer expertise than decrease values of the metric.

Illustration of the glint metric between two an identical frames.

Illustration of the glint between two non-identical frames.

Stability algorithm

To enhance the soundness of reside captions, we suggest an algorithm that takes as enter already rendered sequence of tokens (e.g., “Earlier” within the determine under) and the brand new sequence of ASR predictions, and outputs an up to date stabilized textual content (e.g., “Up to date textual content (with stabilization)” under). It considers each the pure language understanding (NLU) facet in addition to the ergonomic facet (show, format, and many others.) of the consumer expertise in deciding when and the way to produce a secure up to date textual content. Particularly, our algorithm performs tokenized alignment, semantic merging, and clean animation to realize this purpose. In what follows, a token is outlined as a phrase or punctuation produced by ASR.

We present (a) the beforehand already rendered textual content, (b) the baseline format of up to date textual content with out our merging algorithm, and (c) the up to date textual content as generated by our stabilization algorithm.

Our algorithm handle the problem of manufacturing stabilized up to date textual content by first figuring out three courses of modifications (highlighted in pink, inexperienced, and blue under):

Crimson: Addition of tokens to the tip of beforehand rendered captions (e.g., “How about”).
Inexperienced: Addition / deletion of tokens, in the midst of already rendered captions.
- B1: Addition of tokens (e.g., “I” and “buddies”). These might or might not have an effect on the general comprehension of the captions, however might result in format change. Such format modifications are usually not desired in reside captions as they trigger vital jitter and poorer consumer expertise. Right here “I” doesn’t add to the comprehension however “buddies” does. Thus, it is very important stability updates with stability specifically for B1 kind tokens.
- B2: Removing of tokens, e.g., “in” is eliminated within the up to date sentence.
Blue: Re-captioning of tokens: This contains token edits which will or might not have an effect on the general comprehension of the captions.

C1: Correct nouns like “disney land” are up to date to “Disneyland”.
C2: Grammatical shorthands like “it is” are up to date to “It was”.

Lessons of modifications between beforehand displayed and up to date textual content.

Alignment, merging, and smoothing

To maximise textual content stability, our purpose is to align the previous sequence with the brand new sequence utilizing updates that make minimal modifications to the present format whereas guaranteeing correct and significant captions. To realize this, we leverage a variant of the Needleman-Wunsch algorithm with dynamic programming to merge the 2 sequences relying on the category of tokens as outlined above:

Case A tokens: We straight add case A tokens, and line breaks as wanted to suit the up to date captions.
Case B tokens: Our preliminary research confirmed that customers most popular stability over accuracy for beforehand displayed captions. Thus, we solely replace case B tokens if the updates don’t break an current line format.
Case C tokens: We examine the semantic similarity of case C tokens by remodeling unique and up to date sentences into sentence embeddings, measuring their dot-product, and updating them provided that they’re semantically totally different (similarity < 0.85) and the replace won’t trigger new line breaks.

Lastly, we leverage animations to cut back visible jitter. We implement clean scrolling and fading of newly added tokens to additional stabilize the general format of the reside captions.

Consumer analysis

We carried out a consumer research with 123 members to (1) study the correlation of our proposed flicker metric with viewers’ expertise of the reside captions, and (2) assess the effectiveness of our stabilization methods.

We manually chosen 20 movies in YouTube to acquire a broad protection of matters together with video conferences, documentaries, tutorial talks, tutorials, information, comedy, and extra. For every video, we chosen a 30-second clip with a minimum of 90% speech.

We ready 4 forms of renderings of reside captions to match:

Uncooked ASR: uncooked speech-to-text outcomes from a speech-to-text API.
Uncooked ASR + thresholding: solely show interim speech-to-text consequence if its confidence rating is increased than 0.85.
Stabilized captions: captions utilizing our algorithm described above with alignment and merging.
Stabilized and clean captions: stabilized captions with clean animation (scrolling + fading) to evaluate whether or not softened show expertise helps enhance the consumer expertise.

We collected consumer scores by asking the members to look at the recorded reside captions and fee their assessments of consolation, distraction, ease of studying, ease of following the video, fatigue, and whether or not the captions impaired their expertise.

Correlation between flicker metric and consumer expertise

We calculated Spearman’s coefficient between the glint metric and every of the behavioral measurements (values vary from -1 to 1, the place unfavourable values point out a unfavourable relationship between the 2 variables, optimistic values point out a optimistic relationship, and 0 signifies no relationship). Proven under, our research demonstrates statistically vital (𝑝 < 0.001) correlations between our flicker metric and customers’ scores. Absolutely the values of the coefficient are round 0.3, indicating a average relationship.

Behavioral Measurement	Correlation to Flickering Metric*
Consolation	-0.29
Distraction	0.33
Straightforward to learn	-0.31
Straightforward to comply with movies	-0.29
Fatigue	0.36
Impaired Expertise	0.31

Spearman correlation exams of our proposed flickering metric. *p < 0.001.

Stabilization of reside captions

Our proposed method (stabilized clean captions) acquired persistently higher scores, vital as measured by the Mann-Whitney U take a look at (p < 0.01 within the determine under), in 5 out of six aforementioned survey statements. That’s, customers thought of the stabilized captions with smoothing to be extra snug and simpler to learn, whereas feeling much less distraction, fatigue, and impairment to their expertise than different forms of rendering.

Consumer scores from 1 (Strongly Disagree) – 7 (Strongly Agree) on survey statements. (**: p<0.01, ***: p<0.001; ****: p<0.0001; ns: non-significant)

Conclusion and future route

Textual content instability in reside captioning considerably impairs customers’ studying expertise. This work proposes a vision-based metric to mannequin caption stability that statistically considerably correlates with customers’ expertise, and an algorithm to stabilize the rendering of reside captions. Our proposed answer will be probably built-in into current ASR methods to boost the usability of reside captions for quite a lot of customers, together with these with translation wants or these with listening to accessibility wants.

Our work represents a considerable step in direction of measuring and bettering textual content stability. This may be developed to incorporate language-based metrics that target the consistency of the phrases and phrases utilized in reside captions over time. These metrics might present a mirrored image of consumer discomfort because it pertains to language comprehension and understanding in real-world eventualities. We’re additionally desirous about conducting eye-tracking research (e.g., movies proven under) to trace viewers’ gaze patterns, resembling eye fixation and saccades, permitting us to raised perceive the forms of errors which are most distracting and the way to enhance textual content stability for these.

Illustration of monitoring a viewer’s gaze when studying uncooked ASR captions.

Illustration of monitoring a viewer’s gaze when studying stabilized and smoothed captions.

By bettering textual content stability in reside captions, we will create more practical communication instruments and enhance how folks join in on a regular basis conversations in acquainted or, by translation, unfamiliar languages.

Acknowledgements

This work is a collaboration throughout a number of groups at Google. Key contributors embody Xingyu “Bruce” Liu, Jun Zhang, Leonardo Ferrer, Susan Xu, Vikas Bahirwani, Boris Smus, Alex Olwal, and Ruofei Du. We want to lengthen our due to our colleagues who supplied help, together with Nishtha Bhatia, Max Spear, and Darcy Philippon. We might additionally prefer to thank Lin Li, Evan Parker, and CHI 2023 reviewers.