Network: It's, um, English, just like we really speak it: Using an immense data base, lexicographers have taken raw language and produced a revolutionary new dictionary. Robert Nurden reports
Monday 24 October 1994
Their spontaneous chit-chat is part of the Spoken Corpus project, devised by Longman, the publishing company. The project involved creating a data base of 10 million words taken directly from everyday situations - the largest ever compiled in English.
About 150 volunteers agreed to wear a tape recorder for up to two weeks so that all their conversations could be recorded. The result: tapes which, if joined up, would be 34 times the height of Mount Everest.
The tapes were transcibed to produce on disk the world's largest data base of spoken English. Lexicographers then turned this English in the raw - much of it extremely rich - into material for dictionaries.
The Spoken Corpus's first off- spring is the Longman Language Activator, described by a leading grammarian, Professor Sir Randolph Quirk, as 'the book the world's been waiting for'. The Activator is a dictionary for advanced students of English that gives not only a word's definition - as monolingual dictionaries do - but points the reader to related words or phrases as they are actually used. It offers a far wider range of such phrases than a standard thesaurus. For instance, the word 'lucky' leads to 'fall on your feet', 'not know you're born' and even 'keep your fingers crossed'.
Technology has not merely enabled the lexicographers to work more efficiently and accurately, but also to help dictionary compilers to trace the ebb and flow of new expressions in the language - which phrases are taking root and which are disappearing. Previously, they had to guess.
Search techniques also enable lexicographers to test frequency of usage. The data base reveals the words that are favourites in speech but infrequent on the page. The word 'really', for instance, is used five times more often in speech than in writing. The search also shows that women speak in different ways from men, and reveals important details about regional and class speech patterns.
Electronic corpora - the word corpus refers to the fact that this is a collection of words - enable specialists to home in on categories of language that interest them: social science, legal terminology, physics, geography, poetic usage and so on.
Linguists have long known about the importance of phatic communion - noises and pauses, that we use to express doubt, joy, fear, aggression, to play for time or be just plain pig-headed. Um and ah, and even suckings of teeth or intakes of breath, are highly subtle vocal devices that can now be analysed in depth.
The Spoken Corpus is part of the British National Corpus, a collaborative venture between universities and educational bodies that has produced more than 100 million words, 90 million of them written. The Corpus has already changed the way textbooks for foreign students are put together, and has helped Longman to produce the first multimediaCD-rom dictionary. In future, monolingual dictionaries, which are devoted to helping people whose first language is English, will also contain real examples of spoken English, rather than invented ones. The days of dry academics poring over file cards in dusty research rooms have long gone: the dinner ladies from Hackney have seen to that.
Oxford University Computing Services, which is handling sales of the Spoken Corpus disks, have not yet put a price on them. But the Corpus should become available in the next few weeks to linguists, lexicographers and compilers of English teaching materials.
Life & Style blogs
GTA 5 Online DLC: San Andreas Flight School update brings 16-seater jet plane and more
What is ALS and the Ice Bucket Challenge?
James Foley beheading: Twitter 'actively suspending' accounts sharing graphic imagery
Anal sex study reveals climate of 'coercion'
'Long Live the Nazis' spaghetti dish sold in Taiwan renamed after backlash
Scottish independence: English people overwhelmingly want Scotland to stay in the UK
Isis threat: Cameron wants an alliance with Iran
Crisis? What crisis? A visiting US doctor gives the NHS a rave review
Richard Dawkins on babies with Down Syndrome: 'Abort it and try again – it would be immoral to bring it into the world'
Michael Brown shooting: Chaos erupts on the streets of Ferguson after autopsy shows teenager was shot six times – twice in the head
Scottish Independence Referendum: Salmond described as 'arrogant, ambitious and dishonest' by Scottish women
- 1 Richard Dawkins on babies with Down Syndrome: 'Abort it and try again – it would be immoral to bring it into the world'
- 2 ALS ice bucket challenge co-founder Corey Griffin drowns, aged 27
- 3 World peace? These are the only 11 countries in the world that are actually free from conflict
- 4 Nicki Minaj finally releases predictable 'Anaconda' video
- 5 James Foley 'beheading': Met police warn public watching murder video could be criminal offence
£45000 per annum: Harrington Starr: Quantitative Analyst (Financial Services, ...
Negotiable: Harrington Starr: Application Support Engineer (C++, .NET, VB, Per...
Negotiable: Harrington Starr: C# .NET Software Developer (Client-Side, SQL, VB...
£40000 - £60000 per annum + Benefits + Bonus: Harrington Starr: C# Developer (...