FAQ: The Nature of Search Engines

andreas.com FAQ: The Nature of Search Engines

Various notes on the nature of search engines (SE), their abilities, and their limits.

The Limits of a Search Engine as a Tool

  • A tool performs an action on its subject. A tool has appropriate subjects. All other items are not subjects for that tool. Search engines (SE) are tools. They have the same nature (positive and negative, or abilities and limits) of any tool. For example, a hammer can hammer a nail into wood. But a hammer can’t hammer a wine glass into a cloud.
  • There are some things that search engines can find. There are many things they can’t find. There is not, and never will be, a search engine that can find anything, just as there will never be a tool that can do everything.

The Tight-Knit Community (TKC) Problem

  • Google uses link analysis to evaluate the significance of a webpage.
  • Link analysis works well when the topic is clearly defined, there are significant articles about it, and it has an interconnected community.
  • But if a tightly-knit community has many interlinks, this will mislead SEs into ranking the page as significant, when in fact the webpage is irrelevant because the community is wrong.
  • For example, a page may be highly ranked for evolution, when in fact the page is part of a biblical creationist community.
  • For example, a webpage is ranked highly because a bunch of kids create links to it as a spoof (example: a group of kids linked the word “miserable failure” to Bush’s webpage. When you search for “miserable failure”, the first result is Bush’s website.)

The Topic Drift Problem (TDP)

  • If the topic is vague, there aren’t good webpages about it, and there aren’t interconnected communities that discuss the issue, SEs produce weak, wrong, or random results.
  • To avoid TDP, a webpage should have a clear theme, the theme should be familiar to SEs, the webpage should be embedded within its community (links to the webpage from significant members of the community).

Discover Already-Discovered Information

  • Search engines are good at finding what has already been discovered, identified, described, and summarized.
  • If others know about a topic, they understand the topic, and they convert that knowledge into written information, then search engines can find that knowledge.
  • However, search engines cannot discover new information. If something hasn’t yet crystallized into an idea and there aren’t articles, books, summaries, or discussions about it, then there is nothing there for search engines to find.
  • This means: Search engines are good for researching school homework. Search engines are poor for researching a graduate thesis. Search engines are useless for researching a doctoral thesis (which requires discovery of new information).

Search Engines Lose Information

  • SEs find information, but they also lose information. The results of a search are based on the SE’s algorithm. If someone searches for a phrase, the algorithm will index the web, rank the results, and return the results.
  • But when the algorithm changes, the results will change. Results that were found with the previous algorithm will not match the new algorithm.
  • SE algorithms are updated frequently and without notice. If you searched in April and found a result, you may not find it again when you search in September. There is no way to use a previous algorithm. Users are not aware of this.
  • Commercial SEs only want to disply the top 10-20 results. They are not interested in the reliability of results, repeatability of results, nor comprehensive set of results. These factors are important for academic researchers.

Nouns and Adjectives

  • SEs can work well if they get good input. The best search uses a noun with adjectives that qualify the noun.
  • For example, the noun is “cat”. The adjectives qualify that noun by defining the set into smaller sets, such as “red cat” (not black cats, white cats, or calico cats). A yet smaller set would be “big red cat” (vs. small and medium red cats).
  • This happens to work well with English, which uses nouns and adjectives (“big red cats”). But if a language didn’t use adjective and nouns, SEs will return poor results.

The Best Problem, or, the Problem of the Best

  • You can use a set of rules (e.g., the number of inbound links, the age of the page, the frequency of updates on the page, and the authority of the inbound links) to rank pages. Library researchers use bibliometrics to evalute the significance of research papers in Chemical Abstracts (a chemistry research journal). This works well with a relatively small set, say, several hundred research papers on a chemical reaction. Although Google likes to present this as their alogorithm, it was actually developed by library researchers in the 1920s and 30s. The field is called bibliometrics. Google’s innovation was to copy it.
  • Bibliometrics worked in the late 90s because the web was only a few million pages. The bibliometric algorithm was capable of ranking the pages for each domain topic (cars, cats, etc.) because there were only a few thousand pages on each topic. (Pop Quiz: how many topics are there? How many kinds of things do people talk about? It’s not “millions.” Remarkably, it’s a rather low number, and there’s a clever way to find the answer. I’ll leave that to the next class.)
  • By 2010, pretty much every business was on the web. CMS tools (such as WordPress) allow anyone to create websites. Automated tools allow large companies to create tens of millions of pages. In November 2012, Eric Schmidt, CEO of Google, estimated the web at 5 million terabytes (TB) of data (4900 petabytes or 4.7 exabytes). This created a new problem.
  • Let’s say there are one hundred web pages on a topic. You, a panel of experts, or a ranking algorithm can easily pick out the ten best pages. What happens if there are a thousand pages? There will now be several dozen best pages. Let’s increase the topic to several million pages. Several hundred pages will all be the best, but there’s only room for ten on the first page of the SE.
  • At a certain point, the number of best results becomes too large and the list becomes random. When 1,000 pages are all very good, nobody can review all 1,000 to find which is the best of the best. You end up picking several good results and you ignore the rest. It’s likely there are better results, but they are buried.

Information Landscapes

  • SEs can find the easy stuff: you can search for “organic cat food” and so on. But it’s very hard to find meaningful results for vague concepts such as “performance enhancement.” (Try explaining that term to someone who speaks German or Chinese.)
  • How would you find information in areas where there is little information? How would you search for “something that few people know about”?
  • Information can be seen as a landscape: lots of related information appears as mountain ranges, with associated hills, valleys, cliffs, and so on. And there are deserts: vast areas where there is only vague information.
  • See examples of information landscapes cybergeography.org/atlas/info_landscapes.html
  • Google and other SEs use “themes” as a concept to cluster similar types of information. The general concept “feline” is the cluster for house cats, tigers, and lions. Within the feline cluster, there is the house cats cluster. That holds tabby cats, Persian cats, calico cats, and so on.
  • But there aren’t clusters for vague items, so they can’t be found.