Use Robots.txt Wisely To Prevent Duplicate Content

Oct 11th, 2008 | By CJ | Category: article writing

         There are so many differing opinions about the issue of duplicate content, I thought I would revisit the subject and suggest a way to avoid the problem altogether, at least on your own website.
By using the robots.txt file in your root directory and proper use of the robots meta tags on your pages, you can keep the search engine robots from ever seeing certain pages.

According to Tony Murphy,

Search engine robots want your Wordpress blog content. They are programmed to crawl your site, look at everything and report back to the Master Indexer with their findings. The Master Indexer then makes sure that your content can be found. However there are some things that robots in their relentless content crunching march should not have access to. For example the indexing of duplicate content on your blog can lead to the dilution of your blogs authority.

As you know, every category you assign your blog post to, counts as a separate page, as far as the robots are concerned. If you assign a post to more than one category, then the robots think you’ve got that many pages that say the same thing. The first one it finds will be given more weight than the others, and that’s about as far as it goes for being “penalized.” When your blog is fairly new, this really isn’t as big a problem as it will be when it gets a little older and bigger. Duplicate content pages just tend to slow things down, as far as your page ranking goes.

By blocking certain pages, categories and even directories, if the robots never see the duplicate content, you won’t be “penalized.” Any other sites out their that have your content is a different story, of course. This advice only concerns your own website, where your articles should be posted first, prior to sending out to article directories and other places. Remember, the first instance of a page that the robots find will be given the most weight. I have said all along, post your articles to your own site first, wait until the spiders have indexed your page, then submit it to other sites.

To create a robots.txt file, all you need is a text editor (like Notepad). Open up a new page, copy and paste the code below and save the file as robots.txt. Then upload the file to the root directory of your website (the root directory is the public_html directory, where www.your_website.com resides.

User-agent: *
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Allow: /wp-content/uploads

Let’s look at each of those lines.

  • User-agent is the robot/spider. The “*” means “any” spider from any search engine. You can specify particular search engines, but, for this purpose, you will want to tell ALL spider/robots to do the same thing.
  • Disallow: - Tells the robot/spider where it is NOT allowed to go.
  • /wp-admin, /wp-includes, /wp-content/plugins, /wp-content/cache, /wp-content/themes - these are directories that the robot/spider is not allowed to visit. While some of these have nothing to do with the duplicate content issue, nevertheless, they are some places that the robots/spiders just don’t need to go. No one needs to know what plugins you are using, for instance.
  • Allow: /wp-content/uploads - this command line will allow the robots to spider and index your uploads directory. IF there is something in there you do not want indexed (download pages, for instance), change Allow to Disallow.

As for individual pages, you can include a meta tag writeen specifically for the robots to find and pay attention to. These are your options:

<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
    This tells the robots not to index the particular page it finds this instruction on and to not follow any links leading away from the page.
<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
    This tells the robots not to index the page, but to go ahead and follow the links out of the page.
<META NAME=”ROBOTS” CONTENT=”INDEX, NOFOLLOW”>
    This tells the robots to index the page, but NOT to follow links out of the page.
<META NAME=”ROBOTS” CONTENT=”INDEX, FOLLOW”>
    And lastly, this tells the robots to both index the page AND follow links out of it.

You see, the issue of duplicate content is NOT about how many websites out there have the same content. No, no,no! It’s about duplicate content on pages within the same website! And if you blog, and categorize your posts to more than one category, then you are creating duplicate pages (the single post pages) and Google will penalize you for that.

It’s still a good idea to post an article to your website before you send it out to the article directories, but that’s not as big an issue–and this is according to Google!

Stumble it!
Tags: , , , , , ,

One comment
Leave a comment »

  1. Thank you. Info clearly stated. Very helpful.

Leave Comment