The first attempts were made at the laboratories of Bell, the telephone company. In 1936, a Bell Labs scientist, H W Dudley, invented the world's first electronic speech synthesizer: it required an operator with a keyboard and foot pedals to supply "prosody" - the pitch, timing, and intensity of speech. Dudley called his device the "voice coder", though it quickly became known simply as "Voder", and it proved a hit at the New York and San Francisco World's Fairs of 1939.
The problem was the human interaction required. Ideally, one would just give the machine (nowadays, computer) a stream of text which it would render into speech.
Generating sounds is not a problem for computers. Synthesizers have changed the face of popular music. By powering a speaker with a stream of electronic pulses of varying amplitude, they can mimic all sorts of instruments. Generating a human voice is the same task - but language adds complexities of pronunciation and, for the computer, comprehension of what it is reading.
Computers typically generate speech using combinations of "phonemes", the individual sounds within words. The word "phoneme" consists of two syllables, but four phonemes, "ph", "o", "nnn" and "eem". English has 43 phonemes in all. Phonemes are easy to digitise, but it turns out that making recognisable speech from them is harder. The "transition" where one phoneme (say, "ph") elides into the next (say, "o") is difficult to do with a computer, and it is actually simpler to digitise the phonemes and their transitions, and split them halfway through each phoneme. This produces about 400 transition-phoneme pieces like Lego bricks, which can be spliced together for seamless speech. Add the phonemes that start words, and you can produce any word from that library.
An accent is produced by variations in the phonemes and transitions, both in their pitch and speed: the American "tomayto" and the English "tomahto" are one example.
All that is the easy part, though. Turning text into speech also requires analysis of the sentence being spoken, or meaning can be lost: "I'm so pleased to see you" could be read many ways, depending on whether the speaker is so pleased, pleased to see, or see you. Incorporating inflection, pauses and emphasis into computer-generated speech remains the big problem, which scientists are still struggling to overcome.Reuse content