Bringing Google Live Transcribe's Speech Engine to Everyone

forgotmyhnacc · on Nov 5, 2019

Title is misleading, they're not bringing the speech engine to everyone, you still have to pay for Google cloud API requests. They're merely open sourcing the Android app that calls out to the API. See the GitHub link in the article: https://github.com/google/live-transcribe-speech-engine/blob...

wlesieutre · on Nov 5, 2019

That's disappointing, the one exciting feature on the Pixel 4 is its offline transcription feature. From the headline I figured that's what this was about.

walterbell · on Nov 5, 2019

Google published research papers on offline recognition, the feature is shipping on Android phones (i.e. one can inspect a working device), there is an active OSS community for TensorFlow and lots of public work on speech recognition. Many building blocks are public for motivated researchers.

chocolatkey · on Nov 6, 2019

Ugh. Was so happy to read this title, thought I could finally stop work on reverse engineering the app's offline tflite engine. I guess this is motivation to pick that project. Back up

Someone1234 · on Nov 5, 2019

I'm still waiting for meeting transcripts that understand who is speaking. I'm legitimately surprised with how far we've come with speech recognition, how this fairly common use-case is omitted.

I'm not even saying it needs to name the people in the meeting. Just understand, contextually, if it is from "person 1" or "person 2." Then as it records associate it with that name.

Maybe this can help? But Google's existing APIs might be able to do this.

rahimnathwani · on Nov 5, 2019

Have you tried using the 'speaker diarization' feature of any of the commercial TTS APIs (from Google, Microsoft etc.)?

aisofteng · on Nov 5, 2019

The IBM Watson STT service supports this: https://www.ibm.com/cloud/blog/whos-speaking-speaker-diariza...

Amicius · on Nov 5, 2019

While I appreciate the audio codec discussion and bandwidth-to-accuracy tradeoffs, how much of the speech recognition could be done on-device rather than shipping it off to the cloud? It's my understanding that it's a matter of installing pattern files for analyzing the audio without needing to fail over to the cloud; how many GB are we talking to be able to cover normal daily speech, assuming a minimum of jargon? For the hearing impaired, not having to hit the cloud at all seems like the best option (and you don't need to compress the audio at all or worry about cloud-trip bandwidth).

walterbell · on Nov 5, 2019

Google published research [1][2] on offline recognition and it was rolled out earlier in 2019. Model size for English is claimed to be under 100MB, https://techcrunch.com/2019/03/12/googles-new-voice-recognit...

[1] https://arxiv.org/pdf/1603.03185.pdf (2016)

[2] https://arxiv.org/pdf/1811.06621.pdf (2018)

est31 · on Nov 5, 2019

> how many GB are we talking to be able to cover normal daily speech

The models of Mozilla's DeepSpeech STT engine take 1.8 GB in compressed form: https://github.com/mozilla/DeepSpeech/releases/tag/v0.5.1

The main cause for the large size is the language model. They tried using different (smaller) language models, but they weren't as good.

exikyut · on Nov 5, 2019

On Android: Language and Input > Google voice typing > Offline speech recognition, then ensure Wi-Fi and data are off and try shouting at textboxes (you might need to press a microphone button on your selected keyboard, unsure).

It doesn't work very well in my experience.

hentrep · on Nov 5, 2019

Can anyone attest to how accurate this transcription is for technical subjects? I've attempted to integrate transcription into my work life (pharma), but correcting errors related to tech jargon or acronyms/abbreviations always outweighed any benefit.

myu701 · on Nov 5, 2019

I can't speak to Google's, but Dragon Professional supports legal and medical jargon out of the box. Pricey, but for powerful offline speech recognition, that's understandable.

MaupitiBlue · on Nov 5, 2019

Dragon legal’s primary advantage is that it handles citation formatting. I don’t think I’ve ever come across a legal term it didn’t know (maybe dépeçage).

exikyut · on Nov 5, 2019

This still requires an API key; it isn't local.

So I decided it would be a good idea to open an issue in the linked repo, to find out what the costs would look like.

Turns out someone else already did that! https://github.com/google/live-transcribe-speech-engine/issu...

heyoni · on Nov 5, 2019

This is cool but a bit worrisome...remember the days when it was too expensive to log all audio transmissions on any platform/communication device, so you thought you had some level of privacy? Projects like prism might be able to do more than simply log metadata.

Schiphol · on Nov 5, 2019

I've been looking for some way to transcribe my own talks---sometimes I find a turn of phrase or example during the talk that strikes me as useful while I'm giving it, but then forget it. Perhaps this can be coaxed into providing this service for me.

ghaff · on Nov 5, 2019

The ML transcription services work for giving you the gist of what's been said if the recording is of decent quality. I should probably consider doing the same thing. If it's a recording that I want to be "perfect" (e.g. a posted podcast transcription), I still use human transcription; cleaning up the machine transcription isn't worth my time. But if the transcription is mostly to jog your own memory it's probably fine and much cheaper.

guu · on Nov 5, 2019

For talks you can extract the audio with YouTube-dl and upload it to a service like otter.ai

They have a fairly generous free plan.

IshKebab · on Nov 5, 2019

Presumably this requires a cloud API key of some kind? Google's speech recognition API isn't free.

partiallypro · on Nov 5, 2019

I'm not sure about pricing, but if it's like Azure's it's free up until a certain amount of requests.

manishsharan · on Nov 5, 2019

just downloaded on my antiquated LG5. No cloud key required. But the results are hilarious at best. Nearly a total waste of time. I guess we will have to rely on good ole notepad (paper) for now.

ptah · on Nov 5, 2019

Is this used on YouTube?