Most searchers are familiar with the visible Web. The visible Web is the content that can be indexed and searched by search engines. Unlike directories, which feature listings of Web sites that have been visited and reviewed by humans, search engines create their index by sending out a computer program known as a search engine spider. A spider is set to start its journey of the Web at a single Web page, and follows the links that it finds. This leads it to Web sites where it follows links within the Web site finding and indexing information. It also leads it to follow links from one Web site to another, where it again follows links within that Web site, finding and indexing information. The process continues indefinitely as the spiders find new pages to include it their indexes.
The Invisible Web, on the other hand, consists of files, images and Web sites that, for a variety of reasons, cannot be indexed by popular search engines like Google. Some of the most sought after information on the Internet is contained within databases and indexes that are considered part of the Invisible Web.
In late 2000, search company Bright Planet released a study that suggested that the portion of the Web that could not be indexed by search engines was 500 times larger than the portion that could be indexed by search engines. Even today, with search engine technology moving forward at lightening speed, Bright Planet claims that the number of documents that remain unindexed numbers some 550 billion, compared to the 1 billion or so available as part of the visible Web. Additionally, the Invisible Web is estimated to be the fastest growing part of the Internet, which means that each day, more and more resources are available online, but not via standard search engines. Bright Planet also claims that a full 95% of the content contains within the Invisible Web is publicly accessible information and is not subject to subscription or membership fees.
Now that you have an idea of what the Invisible Web is, you are likely wondering why there are resources and information packed Web sites available that the major search engines have not added to their indexes. There are a variety of reasons, but most fall into two categories. The first is the Web pages that search engines are unable to access. The second is the Web pages that search engines are able to access, but choose not to.
Web pages that search engines cannot access
The reality of search engine technology is that it is still limited in its ability to find and spider content. Search engines find pages by following links. Imagine the way that you visit a Web site and click on navigational links, or even links within the text. A search engine follow this same process to find its way through the Web, indexing information as it goes. No, imagine that you visit a Web site that requires you to search through an online database by either typing a query string, or selecting from a variety of form options to start their search. Since search engine spiders are incapable of “thinking,” they cannot submit a query strings to a database driven site and therefore have no way of accessing the wealth of information that they sometimes contain.
Additionally, the page that is created by your search query is nearly always dynamic in nature. This means that rather than someone designing that Web page and placing the content in it by hand, a computer database pulls information from various sources and plugs it into a Web page that is literally built microseconds before you see it. Many times these pages exist ONLY when they are created as the result of a search query. Often, the information on them changes each time they are created, making it not only impossible for a search engine to find the pages, but even more impossible for them to know what the page will be about.
Another technological reason that some Web pages cannot be included in search results is that they require a username or a password before someone can login and view the information on the other side. Since a search engine is incapable of filling out a username request, they have no way to reach the information that is hidden behind that doorway. Most Web sites requiring user names and passwords to login contain information that cannot be found elsewhere on the Web. These sites might contain law journals, public records, or research information that is available only via a paid subscription.
Web pages that search engines will not access
Search engines are designed to index Web pages that can later be searched and accessed by their users. Some Web pages can be found by search engines, but would prove nearly impossible to be “searched” by their users at a later date. For example, a Web page made up entirely of image files. Search engines read the content in HTML text, they cannot visually read the content that might be contained in an image. Thus, on a site consisting solely of images, a search engine has no way to know what the site is about and will be unlikely to return that page as the result of a search query. For this reason, search engines sometimes choose not to index a Web page is unlikely to be found by someone searching their index.
At this point, you may be wondering why Google have a searchable image database, if Google cannot read image files. HTML code provides a special tag known as an ALT tag that can be added to the code around an image to describe the contents of that image. Although this text is rarely enough to base the general ranking of a Web site in the search results, Google does view it as enough information to describe a picture in an image search.
One area of the Invisible Web that is shrinking is that of online documents. At one time, search engines could not read or index the contents of PDF files, MS Word documents and a variety of other file types. As technology moves forward, search engines have become more and more adept at indexing the contents of these files. Most search engines now have no problem including PDF and MS Word files in their search results. Other file types like Flash and Shockwave are still difficult for the search engines to read and index. While search engines like Google are making great progress in reading the contents of Flash files, they are still a long way from fully indexing that portion of the Web.
There is another type of dynamic content that is created by a database, but contains the same content each time and resides at the same URL each time. This type of dynamic content can be (and often is) included in a search engine’s index, but search engines often choose not to include such pages. Why? Poorly written dynamic script (or malicious programming) can lead search engine spiders into what is known as a “spider-trap.” These traps can send a spider into an endless circle of Web pages that is cannot escape, thus taking valuable time away from indexing other Web pages. For this reason, many search engines either exclude, or limit the number of dynamic pages that they spider.
The Invisible Web makes up a large enough portion of the Internet for it to be worthwhile to put extra time and effort into searching for it.