Another spider/crawler season

Crawler, Spider rules? I’ll work on that later I said a few weeks ago. Bad idea.

The crawler rules should have been set before connecting up the domain name, thought I could get something in there later on.

We’ve just had a whole day (Monday night/most of this Tuesday) of a botnet attempting to overload XML-RPC files with logins. Those got blocked automatically. Then the odd looking user agents of bot’s which I’ve never heard of appear all over logs. BaiduSpider appears every hour or so, see’s it’s not allowed in robots then clears off again, So much for ‘we obey that rule’ from their help pages.

Since then we’ve been hardening backend, frontend, technicals and setting up additional protection.

There is no data to try to get at, my main concern is someone trying to use the site as a launchpad for an attack elsewhere. If that did happen I’d probably look at capturing as much evidence as I can and sabotage it.

Our robots.txt currently consists of

useragent: *

Disallow:

Which basically says ‘any user agent should not see anything past the home page’. Obviously adding exceptions along the way for Googlebot and more disallows for Baidu.

List of spiders what ignore Robots and .htaccess methods.

  • Baidu (so far ignored 7 variations of blocks)
  • random unheard of’s with a ‘mission statement’ to check SEO/links and security.
  • random unheard of search engines probably used to profile sites for exploits.

And this week’s bonus!

Hackers were also trying to sneak in too. Annoyingly they still seem to be trying, wasting my time having to look at traffic logs to block them (which does nothing in the end). I did not buy a bigger server for some horrible kids to practice their ‘hax0ring’, if it continues then they can pay for it.

Luckily, most of it got filtered and deflected. So additional bandwidth wasn’t needed

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.