Want to search the Web? Forget it - nobody does. When using a search engine what you are in fact searching is a database of indexed pages that are available on the Web. Research published in the April 1998 issue of Science showed that the percentage of the publicly available Web indexed by the various search engines ranged from an abysmal 3 per cent for Lycos to 28 per cent for AltaVista, with the top slot going to HotBot at 34 per cent coverage. Also, as search engines only index HTML Web pages, they don't even touch what's referred to as "the invisible Web" - information that resides in databases and is accessed via the Web rather than on it.
The fact is that search engines simply can't keep up with the growth of the Web. However, while the percentage of the Web they index is decreasing or remaining static, the number of Web pages indexed continues to grow as a whole. This greater availability of indexed pages doesn't equate to greater relevance and there are only so many pages of irrelevant results users are prepared to scroll through. Consequently, search engine technology is focusing not on increasing the size of search engine databases, but on improving search capabilities and the relevance of results.
Search engines work by matching the location and frequency of users' search terms against the indexed Web pages in their databases and presenting a list of results ranked by relevance. They don't consider the context of the search term, and by looking for exact, literal matches, fail to consider semantics. Furthermore, a Web page is also important if it is popular, has lots of hyperlinks connecting to it, or if it refers to other related Web pages.
Several key technologies have emerged to exploit these factors. Two new search engines, Ask Jeeves (www.askjeves.com) along with the Electric Monk (www. electricmonk.com), use natural language searching, which allows you to ask for information exactly as you would speak it: "How do I do business in Russia?"
Ask Jeeves compares your question to its database of seven million questions while the Electric Monk uses artificial intelligence to conduct a syntactical and semantic analysis of your query, converting it to a complex search strategy that is submitted to AltaVista. In other words, "How do I fix my washing machine?" will also look for words such as "repair", "mend" and "manual". The results for Electric Monk are reassuringly accurate and the technology behind it is now directly available from AltaVista. No more silly brackets, plus signs or quotes when searching.
Two other search technologies, Google and Clever, are based on analysing the link structure of the Web. Developed by students at Stanford University, Google (www.google. com) crawls the Web, analysing how websites link to each other and ranking the results on importance - how many Web pages link to each particular website. If you, as a website author, have included hyperlinks to other Web pages or websites that you deem important, then you have exercised an editorial judgement. The text that you may have written around this hyperlink is your editorial commentary. By analysing hyperlinks and their surrounding text, Google seeks to capitalise on the editorial judgement and commentary of thousands of Web authors worldwide.
Meanwhile at IBM, a team of researchers examining search engine effectiveness have developed a system which was initially referred to as Hits (Hyperlink-induced topic search). Then the marketing staff became involved with the project and it was branded as Clever.
Like Google, Clever analyses hyperlinks and their surrounding text. Unlike Google, however, Clever first submits your query to a search engine and then conducts its analysis on the results which have been produced. This analysis divides Web pages into two categories: authorities - pages about a particular topic that have lots of links to them (ie, they are authoritative sources of information) and hubs - pages which are a guide to, or list, authoritative sources.
IBM has been experimenting with Clever to develop Yahoo!-style Web directories. While not yet available for general release, IBM is currently seeking to license this technology to both portal sites and to organisations with large intranets that want to create their own internal directories.
While Google is a search engine, Clever and Direct Hit are a supplements to search engines. Already licensed for use with HotBot, Lycos, Apple's Sherlock search utility and Netscape Communicator 4.5, Direct Hit (www.directhit.com) is based on the concept of popularity. It monitors which websites users are visiting from search engines and ranks how popular they are. For example, provided their search term is a popular one, say "Bill Clinton", HotBot users can perform a second-level analysis of their search results by clicking on the option "Top 10 most visited websites for Bill Clinton". The White House website will be top of the list. Direct Hit will also give an extra boost to "hidden gems" - any website buried further down in the list of search results that a user visits - the next time it appears in someone else's search results.
Direct Hit provides an excellent second-level filter to identify which of your search results are really relevant, as will Clever when it becomes available. However, with their emphasis on "popularity" and "importance", they have one disturbing element in common: both reinforce the gravitational effect exerted by large portal sites. On the Internet, content and commerce are inextricably linked and whoever controls the distribution of content is guaranteed substantial revenue from electronic commerce.