Home | Programming Resources |     Share This Page
Speech to Text

An online tool that converts spoken words to text

— Portions Copyright © 2013, P. LutusMessage Page

The App | Discussion

(double-click any word to see its definition)

The App
NOTE: At the time of writing (August 2013), this feature is only supported in Google Chrome.

This speech to text processor relies on the Google Speech to Text API described here. The desktop version of this API is at the prototype stage and Google hasn't committed to supporting it on desktops. Google is, of course, fully committed to support for this capability on the Android platform, but it's not self-evident that they will find a reason to support it on desktop machines in the long term.

Meanwhile, as long as Google continues to support the API, we can experiment with it, see how well it works (There's also a version of this page that uses voice input for Web searches: Speak & Search). Also, as people use this facility and for reasons given in the discussion section below, Google's speech processing engine gradually learns how to recognize languages more efficiently. Here are some instructions:

  • Prepare your system's microphone for use. Set a suitable volume level and try to minimize background noise.
  • The API supports many languages in addition to English. Use the provided selectors to choose a language and dialect.
  • When you're ready, press the microphone button at the lower right. In most cases, before beginning to listen, the browser will ask permission to monitor your microphone.
  • At the time of writing, the speech engine accepts special words such as "comma" and "period" and converts them into punctuation.
  • Speak slowly and distinctly. When you're done, just stop speaking — the API will figure out that you're done.
  • When you have stopped speaking, the API will convert your speech into words and display the result in the upper window.
  • If you want to copy the resulting text onto your clipboard, press the "Copy result" button. For security reasons, browser clipboard copying is a two-step process — this page will select the text, but you must press Ctrl+C (Command+C on a Mac) to complete the copy operation.
  • One more thing. Google's vast computer facilities are the source of the processing that converts speech into text. This means Google will be able to hear everything you say, if they choose to listen.


I hope my readers understand that the ability to convert speech into text will become a default computer expectation in the future. But at the time of writing, it's in a rather primitive state — it's only reliable with common phrases and words. This is because speech recognition is much like present-day artificial intelligence in general — it relies on a knowledge of context and setting.

We must resist the temptation to dismiss the speech to text conversion task as a small subset of full computer understanding of speech in the artificial intelligence sense, because it's more than that. For a computer to successfully convert continuous, everyday speech into text, it needs to have a context to avoid the many ambiguities present in speech that humans easily deal with.

What do I mean by "context"? Most human understanding of speech takes place within a conceptual framework, a context, that limits possible interpretations and reduces the complexity of the decoding task. Also, human speakers vary in their manner of speaking, cadence, accent and clarity, so a successful conversion must be able to deal with those variations also. Some natural language conversion programs require a training period to produce reasonable accuracy. It's assumed that future conversion programs will be able to do away with this requirement, just as humans do. Because this conversion program doesn't have a way to train itself, its accuracy is somewhat limited compared to those that do.

As computers continue to become more powerful while costing less, I think it's self-evident that natural language interpretation will become commonplace, and not from the centralized, high-power computing facilities that are presently required to accomplish it, but in devices you can hold in your hand. That will have the advantage of being secure and private, unlike this page's example, where those operating the central servers can choose to listen in on what you say.

Not to unduly disparage present technology, but sentences containing uncommon words are very unlikely to be successfuly converted by this page's converter. I've been trying to get the software to convert the phrase "I hit an uncharted rock, and my boat is in for repairs" for some time now, with disappointing and sometimes hilarious results. I'm sure there are any number of similar phrases that will fail conversion because of the presence of uncommon words, and these examples show the importance of context.

The ability to convert text to speech is now commonplace, but the reverse is still rare, for the reason that speech to text probably requires ten time the intelligence and computing power of text to speech. Again, this will certainly change in the future, and I think eventually text-entry keyboards will only be seen in museums, alongside pencils and abaci.

Home | Programming Resources |     Share This Page