Block Crawlers from HTTPS with Robots.txt

If you allow Googlebot and the other search engine crawlers to index both HTTPS and HTTP versions of your website, you're going to run into some SEO problems. Some search engines see https://example.com/test.html and http://example.com/test.html as different pages, creating duplicate content issues and decreasing each page's worth because their incoming link strength (etc.) is, in effect, divided by two.

I was recently advised by SEO consultants I was working with on a site to simply block crawler access to the HTTPS version of the site to clear this up. The problem is, there is nowhere in robots.txt to specify how to behave based on whether the connection is secure or not. And the filepath for both https://example.com/robots.txt and http://example.com/robots.txt was /export/example.com/robots.txt on my server (the same file). So, I had to get creative...

First, I created a file called robots.php in /export/example.com/ like the following:

<?php
header("Content-Type: text/plain; charset=utf-8");
if ($_SERVER['SERVER_PORT'] == 443) {
	echo "User-agent: *\n" ;
	echo "Disallow: /\n" ;
} else {
	echo "User-agent: *\n" ;
	echo "Disallow: \n" ;
}
?>

Then, I modified the .htaccess file so that robots.php would be called instead of robots.txt:

RewriteEngine On
RewriteBase / 

RewriteRule ^robots.txt$ /robots.php [L]

I was leery about putting the disallow all in there, but I confirmed on on Google help that this was the appropriate way to do it. And, after implementing, we saw great results. The HTTPS versions disappeared from the rankings and their HTTP counterparts rose quickly.

Note: Another way of doing this would be to use .htaccess to show another text file (like robots_https.txt) instead when accessed via HTTPS. I like the PHP approach, though, because it offers more flexibility for other needs (like disallowing the site to be crawled on the dev server, but allowing it on the live server).


Comments

Loading…

This post was published on December 3rd, 2010 by Robert James Reese in the following categories: .htaccess, PHP, robots.txt, and SEO. Before using any of the code or other content in this post, you must read and agree to our terms of use.