Online Research: Understanding the World of the World Wide Web: Search Engines
It is said that even the most prolific search engines only index a very small percentage of the pages that make up what we know as the Internet, World Wide Web or what has been termed – the "surface web". Google is said to index some 4 billion pages and appears to be the search engine of choice by most searchers. Unfortunately there are at least another 6 billion pages that Google does not currently index. (2004 Bright Planet Corporation), which is probably a good reason not to rely on Google for all your searching needs.
One of the main problems faced by most search engines, and of course the people needing quality research information, is that they cannot index what they cannot see. To be found by a search engine, a web site, web page or database either has to be submitted by the author for inclusion in a search engine listing (usually for a fee), or they are harvested with the use of robots or spiders. These crawl through the pages, following links, much like a human might search the Internet. However, in order to be found, a web page has to be written in static HTML code or have enough static HTML code to be visible to the spiders, plus have enough links and keyword rich content to merit inclusion in the search engine directory. These pages are then ranked in order of the search engines algorithm.
As you can imagine, this raises a few questions. Why aren't some sites indexed? Why do some queries give lots of useless results?
As we mentioned previously, sites have to be written in HTML code in order to be found, they then need enough keyword rich content for a search engine to decide whether or not the site is worth indexing. Unfortunately a lot of sites are written using JavaScript and contain dynamically generated content.
Dynamically generated content such as directory listings, exist only in databases, and the "results" only construct when you have asked a direct question of the sites own search engine. If you have ever used online telephone directories (whitepages.com.au) you will know what I mean. Databases such as these are known as "the invisible web" or "Deep Web" and are rarely if ever found within the surface web domain of the search engines.
As with all things, this can change given time – but the basic algorithms that make up the traditional search engines will have to change dramatically in order to harvest this type of content. Alternatively, web developers will have to give enough content in static HTML so they can be found within the surface web domain of the search engines.


Reader Comments