Allow or Deny Search Engines by Using Robots.txt File

Saturday, 28 September 2013

Allow or Deny Search Engines by Using Robots.txt File

04:43:00 Unknown 0

What is Robot.txt?
Web site owners use the /robots.txt file to give instructions about a website to web robots, and this is called The Robots Exclusion Protocol. Website administrator is using robots.txt File to Allow or Deny Search Engines. If you have portions of a website that you do not wish for search indexes to see, you can protect them with a “robots.txt” file dictating which search engines are allowed or disallowed from seeing specific folders/files.

There are many options which you can specify in a robots.txt file to explicitly deny or allow specific search-bots to index certain folders or files. The simplest robots.txt file uses two rules:

User-agent: the robot the following rule applies to
Disallow: the URL you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

Two important considerations when using /robots.txt:

Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information.

Can I block the Bad Robots?
In theory YES, in practice, NO. If the bad robot obeys /robots.txt, and you know the name it scans for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.

DDoS Attack via Bad Robots
If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall. In DDoS situation, the robot operate at lots of different IP addresses (hijacked PCs that are part of a large Botnet), to generate attack load to your server by just simply scanning your website. It will make the entire web server slow or stop with little of bandwidth involved. It's consider as level 7 of DDoS attack, the application attack.

The easiest solution is to use an advanced firewall to automatically block on all these IP addresses that make many connections. To learn more about our DDoS protection solution, please visit http://www.everworks.com/Services/DDoS

Please find below for some other articles that explains how a robots.txt file works, as well as how to and how to configure it for your website.