If you disallow a page in robots.txt, Google may choose to index the page anyway. Google claims to honor, robots.txt, so how is that possible?
Consider this robots.txt file:
User-agent: * Disallow: /some-page.html
Google won't crawl
some-page.html but it may index it anyway. Robots.txt only prevents crawling, it doesn't prevent indexing. Google may index the URL with keywords:
- From the URL itself ("some" and "page)
- From the anchor text of links to the page
When Google chooses to index a page blocked by robots.txt, it will show up in the search results with the message "A description for this result is not available because of this site's robots.txt — learn more."
It is important to note that this happens rarely. It isn't great user experience when users see that message in the search results. Google usually chooses not to index pages that are disallowed by robots.txt. Google only indexed blocked URLs when:
- The URL has several external links.
- Google thinks people may be searching for the resource.
- There are no other unblocked pages on your site for the same keywords.
Google usually uses this to index the home page when the entire site is blocked by robots.tx Goggle does tihs because some important sites don't allow any crawlning. At one point the California department of motor vehicles website was completley blocked by robots.txt and Google used this technique to allow the site to be found anyway.
If you have a page that you never want to appear in search results, you need to do something other than disallowing in robots.txt:
- Password protect the page
- Use the robots meta noindex tag:
<meta name="robots" content="noindex">
- Use a noindex HTTP header:
If you do any of these, you must allow the URL be crawled in robots.txt. If the URL is disallowed, Google won't crawl it to discover that it shouldn't index it.
For further reading
This article was written as part of a series about SEO myths.