Network: It's, um, English, just like we really speak it: Using an immense data base, lexicographers have taken raw language and produced a revolutionary new dictionary. Robert Nurden reports

Click to follow
Indy Lifestyle Online
The unexpurgated gossip of dinner ladies from Hackney, the negotiations of a company director from Newcastle, and the cries of croquet enthusiasts from Bromley have all helped to create a revolutionary dictionary.

Their spontaneous chit-chat is part of the Spoken Corpus project, devised by Longman, the publishing company. The project involved creating a data base of 10 million words taken directly from everyday situations - the largest ever compiled in English.

About 150 volunteers agreed to wear a tape recorder for up to two weeks so that all their conversations could be recorded. The result: tapes which, if joined up, would be 34 times the height of Mount Everest.

The tapes were transcibed to produce on disk the world's largest data base of spoken English. Lexicographers then turned this English in the raw - much of it extremely rich - into material for dictionaries.

The Spoken Corpus's first off- spring is the Longman Language Activator, described by a leading grammarian, Professor Sir Randolph Quirk, as 'the book the world's been waiting for'. The Activator is a dictionary for advanced students of English that gives not only a word's definition - as monolingual dictionaries do - but points the reader to related words or phrases as they are actually used. It offers a far wider range of such phrases than a standard thesaurus. For instance, the word 'lucky' leads to 'fall on your feet', 'not know you're born' and even 'keep your fingers crossed'.

Technology has not merely enabled the lexicographers to work more efficiently and accurately, but also to help dictionary compilers to trace the ebb and flow of new expressions in the language - which phrases are taking root and which are disappearing. Previously, they had to guess.

Search techniques also enable lexicographers to test frequency of usage. The data base reveals the words that are favourites in speech but infrequent on the page. The word 'really', for instance, is used five times more often in speech than in writing. The search also shows that women speak in different ways from men, and reveals important details about regional and class speech patterns.

Electronic corpora - the word corpus refers to the fact that this is a collection of words - enable specialists to home in on categories of language that interest them: social science, legal terminology, physics, geography, poetic usage and so on.

Linguists have long known about the importance of phatic communion - noises and pauses, that we use to express doubt, joy, fear, aggression, to play for time or be just plain pig-headed. Um and ah, and even suckings of teeth or intakes of breath, are highly subtle vocal devices that can now be analysed in depth.

The Spoken Corpus is part of the British National Corpus, a collaborative venture between universities and educational bodies that has produced more than 100 million words, 90 million of them written. The Corpus has already changed the way textbooks for foreign students are put together, and has helped Longman to produce the first multimediaCD-rom dictionary. In future, monolingual dictionaries, which are devoted to helping people whose first language is English, will also contain real examples of spoken English, rather than invented ones. The days of dry academics poring over file cards in dusty research rooms have long gone: the dinner ladies from Hackney have seen to that.

Oxford University Computing Services, which is handling sales of the Spoken Corpus disks, have not yet put a price on them. But the Corpus should become available in the next few weeks to linguists, lexicographers and compilers of English teaching materials.

Comments