If you run a crawler against your own site, it will generally crawl all your pages and then give you a report. It is tempting to think that Googlebot works the same way, but it doesn't. Googlebot doesn't crawl your entire site, wait for a while, and then come back and crawl your entire site again.
Crawling the web is a big job
Google has over 100 trillion pages of content in its search index. Even it Googlebot crawled a million URLs every second, it would take more than a year to crawl that much. By the time Googlebot was done with the crawl, a lot of the web would have changed and the search results would be out of date.
Googlebot has far more capacity than that. Most pages get crawled every few months. Some pages get crawled daily or even hourly. In addition to crawling all the pages in the Google search index, Googlebot encounters lots of URLs that it doesn't index:
- Multiple URLs for the same content
- Redirecting URLs
- Pages too low quality to index
- Supporting resources for pages such as CSS and JavaScript
- Machine generated content and spam
Some sites publish an infinite number of URLs. It is impossible for Googlebot to crawl the entire world wide web. Even crawling all the unique and indexable content is an enormous undertaking.
Googlebot has to work differently
Googlebot is constantly crawling. It never stops. It has to be crawling 24 hours a day, 7 days a week to keep up with all the changes happening on the world wide web.
Googlebot has to pick and choose where to focus its crawling. It has to make trade offs between crawling new URLs and re-crawling URLs to see if the content has changed.
Googlebot crawls pages rather than sites. It views the web as a vast interlinked series of documents. Rather than crawling sites, it crawls all URLs across all sites in the same queue.
Googlebot doesn't crawl everything fairly and evenly. It uses reputation (Google's PageRank) and content change frequency to determine whether to crawl a URL and how often to re-crawl it. Within the same site you may have some important pages that get crawled frequently and some less important pages that get crawled very infrequently.
What will your site experience?
When you launch a brand new web site, Googlebot may come and crawl hundreds or even thousands of pages in a row. After that initial crawl, Googlebot won't ever crawl your whole site at once. Instead, Googlebot will return and crawl your existing pages periodically, each on their own schedule. Any time you create a new page, Googlebot will crawl it fairly quickly after re-crawling any other of your pages that link to it.
As far as I can tell, Googlebot has at least three crawl modes:
- Fresh mode Googlebot where it can quickly crawl hundreds or even thousands of URLs that it has never crawled before.
- Recrawl mode Googlebot where it comes back and re-crawls pages on a schedule based on their reputation and how likely they are to change. This is Googlebot's "normal" mode and it does the majority of the crawling.
-
Stale mode Googlebot where Googlebot quickly re-crawls old, non-indexed URLs without any current links to them that it may not have crawled in a long time. In this mode, Googlebot tends to crawl URLs sorted by length. The short URLs get crawled first.
This article was written as part of a series about SEO myths.