I was talking with several friends a few nights ago about Light. It’s (yet another) new search engine (sorta). You ask questions and the answers are curated by humans. They were talking about the challenge of indexing the web. However, that’s not the problem.
It’s pretty easy to index the web. There are 175m domain names, but only one million sites matter: only about 500,000 sites have more than a few thousand monthly visitors. A few hundred sites account for practically all of the traffic. You could erase 174m domain names and nobody would notice.
Secondly, you wouldn’t need to index every page in that pool of 500,000 domain names. Again, 80-90% of the traffic goes to a handful of landing pages at each domain name. So let’s guess the size of LPs is around 2,500,000 pages (5 x 500k). It’s somewhere in that area.
So the problem isn’t the number of pages.
The problem isn’t indexing. That’s actually pretty easy. Any computer science graduate student could write an algorithm to index those few million pages. There’s only a few criteria: keywords, incoming links from reputable sites, authority, age of domain, etc. That could be sketched out in a few days and done in a few weeks.
Yes, there’s proof for this. This type of indexing is handled by the internal search engines within large corps. A large corp may have several million pages which are easily indexed by internal search tools such as Microsoft FAST, Attivio, etc. (and, yes, the Google internal search tool).
So the problem for global search engines isn’t indexing or algorithms. Their problem is hackers, spammers, and scammers. Because it’s so lucrative to have the top position which gets all the traffic (the 80/20 Rule, which is actually the 99/1 Rule on the web), spammers and scammers do whatever possible to get to position 1. So search engines must focus not on legitimate webmasters, but on illegitimate webmasters.
In internal search, there is a way to block spammers. It’s not called an algorithm. It’s called the HR dept. Namely, if Employee Jones adds 100,000 pages of porn to the corporate site and uses misleading keywords (say, the CEO’s name) to get position #1 in the corporate search engine, well, his career will last only as long as it takes HR to send Security to taser him.
But on the Wild Wild Web, there’s no control. There’s no way to taser a hacker. Hackers attack Google with tens of thousands of pages. So search engines have to jump lively every day to block spam and scam. If they can get top positions, even for a few hours, it can produce revenue.
The challenge isn’t the indexing. The challenge is to protect legitimate pages from being pushed out by illegitimate pages.
So Google, Bing, Yandex, and Baidu have all shifted to using human reviewers. They look to see if it’s a legitimate page. They pick the best pages to show at the top. They’re all looking to develop an AI system. There are differences at the search engines: some use the humans to train the AI; other use the humans to score the AI’s results. Google has 8,000 humans; the others have 2,000 to 5,000 humans.
What does this mean for SEO? The indexing of pages (i.e., add and sort pages to the search engine’s database) is done by technical means, namely, keywords, page speed, authority links, and so on. However, for any topic, there may be 10,000 relevant pages but only a few dozen “best pages”. So the search engine’s humans and the AI look for those best pages (and also taser the scammers). This means the webpage must include both the technical issues and show authoritative quality.