How to prevent unwanted search bots from indexing your site. Print

  • 0

The process of indexing a site with a large number of pages can take a long time and create a large load on the server.
When indexing, search bots send a huge number of requests to your site at the same time, which leads to the problem. There are quite a lot of search robots (Google, Yahoo, Yandex, Mail.RU, ...), and it is incorrect to restrict their access to the site completely (since they benefit your resource).

Solution:
Create a file "robots.txt" in the directory of your site, adding the following parameters to it:

User-agent: *

Crawl-delay: 10

(User-agent - indicates which search engine to use the specified parameters for. Crawl-delay - indicates the time interval with which search engines will load site pages.)

We also recommend disabling indexing of unnecessary directories, such as directories with images, caches, etc.

User-agent: *

Disallow: /administrator/

Disallow: /cache/

Disallow: /cli/

Disallow: /components/

Disallow: /images/

Disallow: /includes/

Disallow: /installation/

Disallow: /language/

Disallow: /libraries/

Disallow: /logs/

Disallow: /media/

Disallow: /modules/

Disallow: /plugins/

Disallow: /templates/

Disallow: /tmp/

You can prohibit indexing of unwanted bots in robots.txt:

User-agent: bingbot

Disallow: /

You can also disable indexing in .htaccess:

SetEnvIfNoCase User-Agent "^bingbot" search_bot

You can find the correct name of the robot for which you want to use this or that rule in the access logs. As an example, here is an excerpt from the access logs:

125.40.77.104 - - [08/Feb/2017:12:05:01 +0200] "GET your_site/ HTTP/1.0" 200 93488 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www .bing.com/bingbot.htm)"

125.40.77.104- - [08/Feb/2017:12:05:01 +0200] "GET your_site/ HTTP/1.0" 200 110513 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www .bing.com/bingbot.htm)"

Correct robot name: bingbot/2.0
If we wanted to block it in .htaccess, the rule would look like this:

SetEnvIfNoCase User-Agent bingbot/2.0 bad_bot

Order Allow, Deny

Allow from all

Deny from env=bad_bot

If the previously mentioned blocking options did not solve the problem, then you can block search bots from accessing the site via IP by writing in .htaccess:

Deny from 125.40.77.104

where 125.40.77.104 is the bingbot IP that we found in the access logs.


Was this answer helpful?

« Back