Bidirectional language translation using AR wearables and multimodal deep learning

January 28, 2020 2-minute read

Advances in neural machine translation over the past decade have propelled natural language translation (often referred to as MT) to achieve all-time accuracy records in testing.

For example, though they are quite imperfect, tools such as Google Translate enable users to convert speech, images, or text in real-time to the same in any one of more than 100 foreign languages. (Don’t we live in such cool times?)

However, while I can personally vouch for the utility of such apps for both travel and business relationships alike, interacting with another foreign-speaking individual via modern translation apps on a phone or tablet can still be arduous and disrupt the natural flow of a conversation.

Additionally, use of such apps today in loud or crowded environments often results in interference with the app’s transcription process due to the unintentional recording of background noise.

Thus, we have two problems today if two speakers without a common language wish to converse using current translator methods:

It can be distracting having to fiddle around with an app on a phone or tablet.
Translation can be disrupted if unrelated background speech is also recorded.

To solve these, I propose augmenting existing translator apps with AR wearables (i.e. connected earbuds and visual headset) paired with a multisensory, attention-based system. The mechanics of such a system are described in the following scenario, in which Alice converses with Bob, who speaks only a language foreign to Alice.

Bob starts speaking to Alice. Alice’s AR glasses detect that Bob is the subject of Alice’s eye focus.

With both front- and rear-facing cameras, current eye tracking technology is capable of estimating the visual region that is the target of a user’s gaze.

A connected app receives real-time audio-visual input from glasses & feeds this into a fused neural net, which outputs an audio signal with “background” (visually not-in-focus) noise removed, effectively filtering out sound that isn’t originating from the speaker (Bob).
Bob’s clean foreign audio is translated to native language audio and/or text via cloud or on-device MT APIs.
Alice’s glasses display text subtitles of Bob’s speech in Alice’s native language. Alternatively, synthesized translated spoken audio can be played continuously in Alice’s earbuds.

Conducting this process bidirectionally could allow for real-time conversation between speakers lacking a common language! Besides language translation, there may be applications for hearing aids as well, particularly in loud environments (see the“cocktail party effect”).