Speaking without vocal cords, thanks to a new AI-assisted wearable device

tbenst · on March 24, 2024

This is a super cool device. Note that the decoding is highly limited: they decode into one of five different sentences. This is easier than five words for example as there is more information to distinguish.

Unfortunately the media is blowing this way of out proportion as the larynx alone does not contain sufficient information to decode silent speech.

If you also sense the lips, tongue articulators, and jaw, then general English decoding becomes possible with high accuracy (eg see our recent work here: https://x.com/tbenst/status/1767952614157848859). It’s not in the preprint but I’ve done experiments with only the larynx recorded and performance is pretty abysmal on even a 10 word vocabulary—-hence why they did a five sentence task.

ImHereToVote · on March 24, 2024

I bet if you listened to the feedback you could teach yourself to talk using the larynx and surrounding muscles.

jvanderbot · on March 24, 2024

Why can't the muscles of the larnex and perhaps chest / diaphragm, be monitored and mapped to vocal chord noises, rather than full speech? Just put the noise in the throat and let the rest of the body make it work.

irviss · on March 24, 2024

> If you also sense the lips, tongue articulators, and jaw, then general English decoding becomes possible with high accuracy

A bit OT but I see this frequently and I'm curious. Why do you English speakers (or just a US phenomenon?) tend to use the word "English" instead of "language", "linguistic" or one of its related words to refer to a general concept?

x1798DE · on March 24, 2024

Not OP, but as a native English speaker and former scientist (though not in this area), I would interpret "x does y on English tasks" to mean "we tested this in English and don't know if the effect generalizes to other languages".

thaumasiotes · on March 24, 2024

In this case we do know if the effect generalizes to other languages. It cannot fail to; the larynx, lips, tongue, and jaw are almost all there is. For example, vowels are conventionally defined by jaw position ("height"), tongue position ("frontness"), and lip configuration ("rounded" or not).

You might miss some things like creaky voice or ejectives, you'll probably miss aspiration, but all that does is give you a worst-case scenario analogous to a native speaker trying to understand someone with a foreign accent. Extremely high accuracy will be possible.

AlecSchueler · on March 24, 2024

This is a reasonable hypothesis but if only English has been studied then it would be unscientific to extrapolate at this time.

thaumasiotes · on March 24, 2024

Sure, in the same sense that it would be "unscientific" to conclude that someone's amputated leg didn't regenerate by chance, because the sample size is only 1.

If you know how you're recognizing English, and you know that other languages do not differ from English in relevant ways, then you know you can recognize those other languages. Pretending you don't know something you do know is not scientific.

brookst · on March 24, 2024

This seems like damned-either-way. If they had only tested English and asserted that it was universally applicable to all languages, it’s likely you (or someone else) would rightfully object that it’s annoying when English speakers assume that’s all there is.

thaumasiotes · on March 24, 2024

That's not a similar claim. Anyone can be annoyed by anything; the idea that it's "unscientific" to state that a method of recognizing English by measuring the positions of the lips, tongue, and jaw alongside the activity of the larynx will apply to every other spoken language in the world is ludicrous on its face. It will, because those measurements capture nearly every dimension of phonetic variation that exists. No one could believe otherwise, except apparently for metabagel.

brookst · on March 25, 2024

Is absolute belief in one’s one ability to estimate how every human language could possibly work terribly scientific?

Me, I like scoping claims to what is measured.

thaumasiotes · on March 25, 2024

You say that like no one's ever bothered to measure what kinds of sounds can be used in human languages.

The opposite is the case; this is not a lightly studied field.

AlecSchueler · on March 25, 2024

It's scientific to say other languages are "predicted" to show similar results, but unscientific to say we know they do.

AlecSchueler · on March 25, 2024

You don't know, though. You have a good working hypothesis and you can make reasoned predictions, but it remains untested. The core principle of science is that we test our hypotheses.

metabagel · on March 24, 2024

Other languages have different sounds which aren’t present in English.

thaumasiotes · on March 24, 2024

So? They don't have sounds that are produced in a manner other than arranging the lips, tongue, and jaw.

(Actually, they do. So does English; I already mentioned aspiration. But those are minor elements.)

wizzwizz4 · on March 24, 2024

They're minor elements in English – and even then, you can construct sentences where the meaning changes based on aspiration.

thaumasiotes · on March 25, 2024

Well, no, they're minor elements everywhere. You don't need to be able to capture every phonemic distinction in a language to get a near-perfect transcription, as witnessed by the fact that people understand foreign accents without difficulty. The much larger problem in understanding foreign speech is the odd word choices and lack of grammaticality, but those problems don't arise when you're transcribing native speech.

For some comparisons, think about the fact that Semitic languages are traditionally written without bothering to indicate the vowels, or that while modern English has a phonemic distinction between voiced and unvoiced fricatives, this has a very uneven correspondence to the same distinction as it exists in the writing system. In the case of the interdental fricatives, the writing system does not even contemplate a distinction. And there's nothing particularly problematic about this; if you delete all the voicing information from a stretch of English speech, it stays about as intelligible as it was before. (A voicing difference in stops is not even audible to English speakers. It's audible in fricatives, but no one is going to be confused.)

wizzwizz4 · on March 25, 2024

> For some comparisons, think about the fact that Semitic languages are traditionally written without bothering to indicate the vowels, or that while modern English has a phonemic distinction between voiced and unvoiced fricatives, this has a very uneven correspondence to the same distinction as it exists in the writing system.

And there's a very uneven correspondence between vowels as they exist in speech, and as they exist in the English writing system. Thought dissent mannequin swipe them or bite roar a lie.

You're right that usually, in English, you can understand a sentence with aspiration information stripped out. But just because it's not (usually) significant in English, that doesn't mean that's universal across all languages! Wikipedia has a short lists of languages where aspiration makes a difference. https://en.wikipedia.org/wiki/Aspirated_consonant#Phonemic

> In many languages, such as Armenian, Korean, Lakota, Thai, Indo-Aryan languages, Dravidian languages, Icelandic, Faroese, Ancient Greek, and the varieties of Chinese, tenuis and aspirated consonants are phonemic. Unaspirated consonants like [p˭ s˭] and aspirated consonants like [pʰ ʰp sʰ] are separate phonemes, and words are distinguished by whether they have one or the other.

tbenst · on March 24, 2024

x1798DE captured my intent well. For example, tonal languages like Mandarin or Cantonese may be more difficult to decode if vocal cords aren’t vibrating, and languages with more phonemes that have both a voiced and unvoiced version might be more difficult. I still think decoding will be possible for general language, but that’s a hypothesis whereas I know it’s true for English.

thaumasiotes · on March 25, 2024

> and languages with more phonemes that have both a voiced and unvoiced version might be more difficult.

I had the understanding that English is unusually rich in phonemes that occur in both a voiced and unvoiced version. But as I've mentioned sidethread, this just isn't very significant as far as transcribing English goes.

English has an almost full series of stop and fricative phonemes that exhibit voicing contrasts:

- Bilabial, alveolar, and velar stops /p, b, t, d, k, g/, though the distinction between /t/ and /d/ disappears intervocalically in American English. [In practice, English speakers differentiate these phonemes more by the contrast of aspiration than by the contrast of voicing.]

- Interdental, labiodental, alveolar, palatal, but generally not velar, fricatives /θ, ð, f, v, s, z, ʃ, ʒ/, along with palatal affricates /tʃ, dʒ/.

- Nasals and approximants are always voiced.

Compare a language like Mandarin Chinese, where there are between zero and one pairs of phonemes that contrast by voicing (the sound represented by pinyin "r" may be a voiced fricative otherwise equivalent to "sh", or it may be an approximant; there is no contrasting voiceless approximant), or Spanish, where only the stops feature this contrast.

What are the languages that have more voicing contrasts than English does? It would almost be necessary for such a language to distinguish between voiced and unvoiced vowels. (Some quick research suggests that Icelandic at least has a comparable number of voicing contrasts, but it is not obviously more than English and appears to be actively shrinking.)

> tonal languages like Mandarin or Cantonese may be more difficult to decode if vocal cords aren’t vibrating

More difficult, yes, but in the sense that decoding may take more computation, not that the error rate will go up.

Again, we can already observe that e.g. Mandarin speakers do not have trouble understanding text that carries no information about tone, nor do they have trouble understanding songs, where lexical tone is overridden by the melody of the song.

(What happens here depends what you mean. If you want to decode speech into pinyin with tone marks omitted, the lack of ability to measure tones will fail to be a problem by definition. If you want to decode into Chinese characters, you'll need a robust model of the language, at which point lack of tones will also fail to be a problem - the language model will cover for it. If you want to decode into pinyin with tone marks, you won't be able to do that without using a language model.)

roenxi · on March 24, 2024

I'd speculate English speakers are used to being part of a society where non-English speakers are present and politically important. It is polite not to assume that English = language. Even on the British Isles English isn't a universal thing. Let alone somewhere like America where it isn't even native.

"Language" just doesn't mean "English". In Australia if someone is talking about "language" on its own I'd assume they're Aboriginal advocates.

AlecSchueler · on March 24, 2024

> Even on the British Isles English isn't a universal thing. Let alone somewhere like America where it isn't even native.

English isn't native to all of those isles either only Great Britain.

khazhoux · on March 24, 2024

This is your misperception.

In the instances where a person says "English" in this kind of context, it catches your attention and you infer that the person is an English-speaker, and possibly American.

But when a person uses the generic word "language", you don't notice it.

This leads you to believe that English speakers "tend to use the word English," when that's not the case necessarily.

I don't know what this perceptual fallacy is called, but there's probably a word. In English :-)

atopal · on March 24, 2024

There are about 6000 spoken languages around the world with an extreme variety in how they produce meaning. How could you make sweeping statements about all of them?

johnisgood · on March 24, 2024

I have not noticed this. I just assume that they are specifically talking about a language, in this case: English.

croemer · on March 24, 2024

"Speaking" is a hyperbole. It allows you to say exactly 5 phrases with 95% accuracy, after repeating each sentence 100 times. In other words, it's totally useless. The sentences are so different that they can be distinguished almost entirely by length. I'm very surprised anyone thinks this is useful.

Excerpt: "A brief demonstration was made with five sentences that we had selected for training the algorithm (S1: “Hi Rachel, how you are doing today?”, S2: “Hope your experiments are going well!”, S3: “Merry Christmas!”, S4: “I love you!”, S5: “I don’t trust you.”). Each participant repeated each sentence 100 times for data collection."

I never read a press release from a university, it's always exaggerated.

Original study: https://www.nature.com/articles/s41467-024-45915-7

light_hue_1 · on March 24, 2024

Exactly. They took a neat device and wrapped it in a BS story that's wildly unscientific.

Nature and Science don't mind if people outright lie about what their research means as long as it gets hits. The paper is pretty much just as bad.

This is how mistrust for science slowly builds up when people publish obvious falsehoods.

zharknado · on March 24, 2024

Very cool! This is an insanely impressive sensor, but the proposed application is still in dream phase.

> Going forward, the research team plans to continue enlarging the vocabulary of the device through machine learning and to test it in people with speech disorders.

They haven’t tried giving it to a person with a voice disorder. So it just might not work in that application at all. That will likely depend on the degree to which laryngeal muscles are implicated in a given person’s disorder.

That’s certainly a valid starting place for research purposes, but it’s very early days.

And I imagine you’ll need some very interesting cabling attached to a somewhat beefy device to actually run live inference from this data, plus to drive the speech synthesis.

khimaros · on March 24, 2024

i'm really excited to see progress being made in this space. subvocal speech recognition seems to be an underfunded area of research.

my sense is that it has the potential to make hands free interaction with our devices in public spaces less obnoxious and, consequently, more socially acceptable.

however, i notice that the article doesn't mention anything about dictionary size, which is a very important consideration for a tool of this kind.

thorum · on March 24, 2024

> The research team demonstrated the system’s accuracy by having the participants pronounce five sentences — both aloud and voicelessly — including “Hi, Rachel, how are you doing today?” and “I love you!” (…) Going forward, the research team plans to continue enlarging the vocabulary of the device through machine learning and to test it in people with speech disorders.

It's a proof of concept at this stage but very cool.

dontreact · on March 24, 2024

While subvocal is cool and would allow for speech in more places, something that’s earlier on the tech tree and that I would like to see is just robust lipreading.

I already am comfortable talking to my phone quietly using my AirPods while looking at my screen, but it seems like in loud public places the accuracy becomes unusable. I imagine it could be easily recovered by the additional signal of lipreading.

graphe · on March 24, 2024

https://en.m.wikipedia.org/wiki/Basic_english this communicates English efficiently with 850 words. I don't think it's basic English is any good but I can see them making simplified English the lingua franca to boost 'literacy rates' in the future.

anonylizard · on March 24, 2024

This seems only useful to people who once had a voice, then lost their voice. Because only this way, would they have a unified mapping of voice cord movements to actual voices. Deaf and mutes can't really use this.

It also basically mandates a patch to your throat, because no way of detecting vibrations otherwise.

I wonder if there are visual based ways, like sign->text, expression->text, that would benefit from the larger developments in LLMs. Like an LLM that has access to your conversation history, so when you give your smartphone camera a hand sign and a smile, it can guess and output an entire intended speech.

feverzsj · on March 24, 2024

How it compares to electrolarynx, which give you robotic sound.

joshspankit · on March 24, 2024

Is it just me, or does anyone else think it would be amazing to use upcoming voice assistants with something like this letting you “talk” silently?

ImHereToVote · on March 24, 2024

I'll take the whole lot.