Thursday, August 2, 2007

How Crawler Works


Just like I promised, I will tell you why your ads aren’t relevant. But before that I will now tell you how the crawler knows the content of a page. A crawler is a machine that will read and indicate the content of the pages. Actually it just reads every word. And it will find the relation between each word to know what the content is.

Basic example:


These are 3 pages with the word “Java” with different content.

Page #1 has the words: “Java”, “C++”, “Programming”, and “Computer”.

Page #2 has the words: “Java”, “Beans”, “Coffee”, and “Cup”.

Page #3 has the words: “Java”, “Island”, and “Indonesia”

Crawler will consider the 1st page is talking about programming language. So maybe there will be ads like “Java Tutorials”, “C++ Programming”, etc.

The 2nd page will be considered talking about coffee drink. So maybe there will be ads like “Quality Beans”, “Cheap Coffee”, etc.

The 3rd page will be considered talking about my homeland^^ (Yes, I used to live in java). So maybe there will be ads like “A Tour in Java Towns”, “Hotel in Yogyakarta”, “Javanese Culture”, etc.


The crawler report is updated weekly. So please be patient

There are 2 crawlers: Google crawler and AdSense crawler (Additionally there is also Sitemaps crawler but let’s put this thing aside for now). They work separate but they will not requesting the same page twice (so that the publisher will not run out of bandwidth). But it will request the site without “www” and with “www” separately. AdSense crawler will only access URLs that request Google AdSense. Those crawlers will only access the original pages.


The crawler won't access pages or directories prohibited by a robots.txt file (robots.txt will prevent the crawler accessing pages which are unnecessary to crawl)

No comments: