To bee or not to bee

Search engines may be remarkable resources, but they're not intelligent. Will a new 'semantic' web be clever enough, asks Danny Bradbury, to tell a flying insect from a work of music?
Click to follow
The Independent Online

Web searches have always been a bit hit and miss. Even when your searches are clearly defined, you'll turn up irrelevant web pages that happen to have the same keywords. Looking for details of bumble bees' flight? Google's first result points to the composer Rimsky-Korsakov.

That ambiguity is why web-based research is a manual activity: search engines need a human operator to filter the useful documents from the useless ones. Programming an artificially intelligent program to do this for you would be a daunting task, because machines can look for combinations of words but they can't understand the concepts behind documents.

But if Eric Miller gets his way, we'll soon be letting machines do the thinking for us. Miller is in charge of the Semantic Web working group at the World Wide Web Consortium, the organisation that governs the standards behind the web. He foresees a future when the web will develop from a hotchpotch of unstructured information into a refined, indexed set of information resources that machines will be able to understand. The Semantic Web, which you could think of as the web mark two, will underpin all of this, he hopes.

"My goal isn't to replace the web with the Semantic Web - it's to create one whole web," he says. "I want the web in my car, on my cellphone and on my PDA, with each of these things becoming browsers and servers, recording information other devices can integrate into a cohesive whole."

The underlying beauty of the Semantic Web is that it knows what it contains. Web searches have always been hit and miss because the search index is essentially unintelligent - it holds information that is referenced by keywords, without any context of what the information actually means.

Tim Berners-Lee, the inventor of the protocols underlying the World Wide Web, is another key player who wants to change that. In 1998 he and a couple of colleagues proposed a "second generation web" which they called the Semantic Web. It would be built on "ontologies" - descriptions of information in a particular subject area. Think of them as glossaries that describe how specific terms relate to each other. So, an ontology dealing with aircraft might include basic rules explaining that a wing creates lift, that a biplane has two sets of wings, that a monoplane has one set of wings and that a jumbo jet is a monoplane. It might also have more detailed information, such as that a delta wing is a type of wing designed to help aeroplanes break the sound barrier.

These rules are encoded in a language called the Resource Description Framework (RDF). Based on the now popular XML (a development of the web's HTML page encoding language), RDF is designed to describe concepts and their relationships so that they are understandable by computers. Suddenly, machines are able to find tagged information and understand how different terms relate.

Miller uses the rock star Sting as an example. You want information on his upcoming albums, concert series, where he is going to be, and where you might be too. You also want to know the movies he starred in and any books he's written. You're interested in knowing his interest in certain kinds of art, and certain types of sculpture he likes. "So, you're collecting data about this guy from lots of different sources. But the library won't index him under Sting - the information comes from lots of different places," Miller says. Semantic Web technology can go and find this information for you because although it won't be obviously connected, RDF will connect the underlying concepts together in ways that a machine can understand. Put like this, the Semantic Web would be a college student's dream. Essays would almost write themselves, as the web became more like an encyclopedia and less like a melting pot of unstructured information.

The creators of the Semantic Web also hope to see see intelligent software agents that roam the internet looking for information on a person's behalf. Want to know which restaurants in a particular city are offering all-you-can-eat meals on special offer? How about finding out which cinemas are showing romantic comedies this evening, but only the ones starring Tom Hanks or Meg Ryan? Let the Semantic Web agent sort it out for you while you get on with something else.

This all sounds wonderful, but five years on it has failed to materialise. According to John Davies, manager of next-generation web research for BT Exact, BT's research arm, that's because the web is simply too huge. "Whether it will make the step to the external web, the jury is out. It's unlikely that anyone will turn those five billion pages into RDF any time soon," he points out.

One way to make it happen would be to build artificially intelligent software robots that go out and automatically recode the existing information on the web in RDF form. But it would be very difficult to make them accurate and smart enough. Another idea, says Davies, is to start from scratch. "As people generate information it's easy to generate metadata. In Microsoft Word, for example, there are already facilities that let you specify the author of a document. If you could extend that a bit so that the information could be stored in RDF, then you'd start to encourage people."

But one of the biggest problems facing the architects of the new, smarter web is that the ontologies - the indexes of concepts - are still being formed. There are a lot of subject areas to cover - indexing all the concepts in the world is one thing. Building information into the index that describes how they all relate to each other is daunting in the extreme, especially because people are currently building ontologies that refer to single, specialist areas of knowledge such as aircraft design or music copyright. This is why the Semantic Web tools that Davies is building, including a Semantic Web search engine called QuizRDF, are being used by companies for vertical market applications, such as Airbus's application, to manage the information used during aircraft design.

Miller is convinced that the task of cross-referencing ontologies needs to be attempted, however. "There's a blur between what is domain-specific and what is cross-domain information," he says. "People are beginning to realise that the information you think is domain-specific can be useful outside the domain."

This will become increasingly true as the ontologies that people build become more generic and flow into each other. One possible solution to the cross-domain problem lies with the Dublin Core Metadata Initiative, of which Miller used to be the associate director. This organisation is finding new ways to map ontologies together. Hopefully in time this will create a huge fabric of concepts that can be referenced no matter what specific area of knowledge you begin your search in - and it could eventually lead to the creation of a Semantic Web that the average consumer can use. But don't expect it for a while - there is a lot of work to be done yet.

The first implementations of Semantic Web technology are taking place behind the scenes, within companies that work in specific industries. Because these industries have their own tightly defined terms and concepts, ontologies can be developed more easily. Davies explains that knowledge management - making the most use of the expertise within a company - is one of the first areas where Semantic Web technology will flourish. For now, you'll have to seek out Sting's greatest written works, or the secrets of the bee's flight, for yourself - but in theory, a machine might find it out for you, in time.