Protecting website against crawler attack

Published on Jul 31, 2012

Protecting website against crawler attack

Automated robot (or web spider) attacks are commonly used for sophisticated attacks at targetted websites. In a typical scenario malicious attacker will hire a virtual hosting with dynamic IP address for a short period of time and periodically executes programmed application performing consequent HTTP requests with the goal to explore and/or cause potential portal failures.

Our company runs online portal ranking amongst top 3 online insurance calculators. Perhaps due to tightened competition, we have noted few unusual symptoms in portal's behavior - such as unexpected database crashes during weekends (overpooling), or unusually high number of insurance calculations via google analytics (cca 8-10 times more than usually).

After investigation we discovered a lot of HTTP requests coming from the same IP address periodically for 2-3 hours. Simply blocking the IP address did not help because next weekend the same attack occured again - however from different IP address.

This articles gives short notes on how we made portal more resistent against such automated attacks by applying a few simple security rules.

1. Two consequent HTTP requests < 2.5 secs per IP

Automated robots do not follow typical human-like behaviour. It would take inadequate programming effort to emulate pretty complex human-like behavior - e.g. page frequency submits dependend on the content quality - for a human it may take 10 seconds to scan through the page, or it may take 30 seconds to enter requested data into a formular.

Robots, on the other hand, are programmed and they do requests immediatelly (unless told to pause). So measuring the time between two consequent HTTP requests from a particular IP address was the first rule we established to distinguish robots from human visitors. If the page request came earlier than before specified time passed (e.g. 2.5 secs), then the page was not processed and returned the same previously generated HTML content (fast response since the page is only loaded from cache).

This rule helped us to identify about 3-4 attacks within few weeks after implementation and helped to isolate attacker's IP address.

We implemented all rules into a dedicated class named "Security". Following routine applies the first rule:


class Security {

	/**
	* Minimum seconds since last previous non-ajax request, float, default 2.5 sec
	*/
	const SECONDS_SINCE_PREV_REQUEST = 2.5;

	/**
	* Return true if passed minimum time since previous request
	*/
	protected static function isSessionRequestFrequencyOK(){
		$lastRequest = !empty($_SESSION['security']['lastRequest']) ?
							$_SESSION['security']['lastRequest'] : 0;
		$t = microtime(true);
		if($lastRequest){
			$d = $t - $lastRequest;
			if($d < self::SECONDS_SINCE_PREV_REQUEST){
				mylog('Rejected too early request ['.($t - $lastRequest).'] < ['
					.self::SECONDS_SINCE_PREV_REQUEST.'] (remote IP: ['
					.$_SERVER['REMOTE_ADDR'].'])');
				return false;
			}
		}
		$_SESSION['security']['lastRequest'] = $t;
		return true;
	}
	...
}

2. Maximum number of requests per session < 300

Depending on the nature of our website and the number of CMS pages, we estimated that a common visitor will probably never need more than 300 requests. Our CMS holds up to 100 pages plus making a single insurance calculation takes no more than 10-15 page submits (when stepping between insurance steps in insurance wizard).

Just like the first rule, also this second one has almost no overhead since uses only a single counter stored in a session. A method named "isSessionRequestCountOK" within class Security implemented the rule:


class Security {

	/**
	* Maximum accepted number of HTTP requests per session
	*/
	const MAX_SESSION_REQUEST_COUNT = 500;

	/**
	* Return false if reached maximum number of requests per current session
	*/
	protected static function isSessionRequestCountOK(){
		$count = !empty($_SESSION['security']['requestCount']) ?
					intval($_SESSION['security']['requestCount']) : 0;
		if($count >= self::MAX_SESSION_REQUEST_COUNT){
			mylog('Reached maximum number of allowed session requests ['.$count.'] '
				.'(remote IP: ['.$_SERVER['REMOTE_ADDR'].'])');
			return false;
		}
		$_SESSION['security']['requestCount'] = ++$count;
		return true;
	}
	...
}

3. Maximum 100 requests per IP for last 10 minutes

This rule is probably the most helpful in detecting robots but it is evaluated last due to using more resources via querying the database. It does not rely at session, which can be reset upon each HTTP request and thus constantly emulating a new visitor. Therefore this rule effectively filters out any robot attack.

According to google analytics for our portal we found that a common visitor rarely spends more than 1 - 30 minutes on our website making in average 10-40 page submits. So we set the rule, that no human visitor needs to do more than 100 requests within 10 minutes. Unlike human, robot can easily do 100 requests within 3-4 minutes or even less.

This rule was applied only once within 3-4 months after implementation. After discovering the attack we blocked remote IP on firewall for certain time.

A simple SQL table (mySQL) stores remote server IP along with time of the HTTP request.

Note that the table uses the engine InnoDB which allows locks on record-level unlike myISAM which locks on table-level and therefore slowes down database responses on concurrent requests.


CREATE TABLE `request` (
	`time` DECIMAL(14,4),
	`ip` VARCHAR(50),
	INDEX `time` (`time`),
	INDEX `ip` (`ip`)
)
ENGINE=InnoDB;

Method "isRequestByIpOK" applies the rule:


class Security{

	/**
	* How many last minutes should be checked per current IP
	*/
	const IP_LAST_MINUTES = 10;

	/**
	* Maximum number of requests for current IP since last IP_SINCE_LAST_MINS minutes
	* Note: ajax requests dont count
	*/
	const MAX_IP_LAST_MINUTES = 100;

	/**
	* Check maximum number of requests for current IP since last 30 minutes.
	*/
	protected static function isRequestByIpOK(){
		if(!empty($_SERVER['REMOTE_ADDR'])){
			$ip = $_SERVER['REMOTE_ADDR'];
			$sql = "INSERT INTO `request` (`time`, `ip`) "
					."VALUES (".microtime(true).", '{$ip}');";
			sql_query('MAIN', $sql);
			$t = (float)(microtime(true) - self::IP_LAST_MINUTES*60);
			$sql = "SELECT COUNT(*) as cnt FROM `request` "
					."WHERE time > ".$t." AND ip = '{$ip}'; ";
			$res = sql_query('MAIN', $sql);
			$row = sql_fetch($res);
			if($row['cnt'] >= self::MAX_IP_LAST_MINUTES){
				// log rejected request
				mylog('Suspicious IP ['.$_SERVER['REMOTE_ADDR'].']');
				return false;
			}
			// optionally do some clean up - delete outdated records
			if(mt_rand(1,10000)>9998){
				Security::cleanup();
			}
		}
		return true;
	}
	...
}

Wrapping up ...

A single HTTP request will only be processed when all of three rules above return true. A public method "isRequestAllowed" within class Security checks all three rules and return TRUE only if everything is OK:


class Security{

	/**
	* Return false, if current request does not comform with security rules
	* Ignores ajax requests and administrator's requests.
	*/
	public static function isRequestAllowed(){
		if(isAjax() || User::isAdministrator()){
			return true;
		}
		return	self::isSessionRequestFrequencyOK()
				&& self::isSessionRequestCountOK()
				&& self::isRequestByIpOK();
	}
	...
}

4. Additional measurements for increasing portal security

In order to keep up with the basic website security, following measurements should be always applied to any website regardless, whether it is protected against automated attacks or not:

4.1 Request captcha after 3rd login attempt within last 2 hours

Captcha should not bother users. There is no reason to require captcha on first login attempt (like many sites do). However, captcha shoud be requested always after 2nd or 3rd unsuccesfull login attempt.

Following is example of SQL table storing login attempts:


CREATE TABLE `user_login` (
	`time` INT NOT NULL,
	`username` VARCHAR(64) NOT NULL DEFAULT '',
	`ip` VARCHAR(50) NOT NULL,
	INDEX `time_login` (`time`, `login`),
	INDEX `time_ip` (`time`, `ip`)
)
ENGINE=InnoDB;

A method within class Security checks whether an additional captcha code should be requested:


class Security{

	/**
	* After how many user login attempts is captcha requested
	*/
	const MAX_FAILED_LOGINS = 3;

	/**
	* Return TRUE if user exceeded maximum number of allowed login attempts
	*/
	public static function isCaptchaRequired() {
		$uts = time() - 3600 * 2; // since last 2 hours
		$res = sql_query("SELECT count(*) AS `cnt` FROM `user_login` WHERE `ip` = '"
				.$_SERVER['REMOTE_ADDR']."' and `time` > ".$uts);
		$row = sql_fetch($res);
		return $row['cnt'] > self::MAX_FAILED_LOGINS;
	}
	...
}

4.2 Always use SSL for login section

Login section requires sending both login password and user name through the internet. Catching plain unencrypted packets from certain URL page is pretty easy task these days - a malicious attacker can use any free network analyzer having it installed on his virtual hosting server.

However, even not-valid SSL certificate (e.g. self-signed) will fulfill the purpose of encrypting the data being sent over the internet. A free SSL class 1 certificate for 1 year can be obtained at StartCom website. Make sure to set correct certificate hierarchy when installing the certificate at your hosting HTTP server.

4.3 Measure the password strength when user changing password

Users are the most dangerous element on the internet:-) They are heavily learning resistent - always using simple guessable vocanulary-based passwords like "marry" or "bigboy". They are rarely aware, that hacking into their own account will endanger perhaps hundreds of other site users. That's why it is (unfortunatelly) necessary to enforce users to follow at least basic security rules - one of them is visually displaying password strength (password strength meter).

There are few javascript snippets to measure the strength of a password. Following is example of user-friendly kicking ass and saying "provide stronger password":

Password strength meter - Demo

There are many javascript implementaions, e.g. following is my implementation simple-js-password-checker with no extern dependency:

Your password

Following minimum safety rules should be enforce when setting a password:

Common website users:

minimum 6 characters

combined letters and numbers

Website administrators:

minimum 8 characters

combined letters, numbers, upper and lower case characters and non-human characters (*!_-=%$#...)

4.3 Never send login password and login user name in the same email

Let user choose desired login user name while registering, but only password should be sent to his email. Again, sending unencrypted plain text with both login password and username is high security risk.

4.4 Implement RBAC or any kind of function User::isAdministrator()

Role based access control (RBAC) is a must for any corporate website. Permissions for accessing the functionalities to be executed should always be checked before execution. Since implementing RBAC is quite a challenging task, good solution might be to use some better PHP framework - most of frameworks implement some kind of checks of access permissions (Yii, Zend, CodeIgniter, CakePHP, Kohana, etc).

Conclusion

With sophisticated tools available on the internet today, it is becoming more difficult to protect websites against automated malicious attacks. Attacks performed by a programmed robots are a pretty new way of exploring competitor's sites. Three rules described above, however, helped us to increase portal stability and to isolate IP addresses from which attacks were executed.

Additionally, some basic important security measurements should be applied to any website - these include e.g. using secure connection for login section, enforcing minimum password strength and requesting catcha after third login attempt for last 2 hours for a particular user.

Barcode scanners in PHP applications » « Browsers - wrong ENTER - TAB key behaviour