Pages

Subscribe

Tips and Examples for how to use your .htaccess file

.htaccess file explained with examples for eliminating referrer spam and deep linking

Someone recently asked how to block a particular IP address from their
web site. the .htaccess file is a good way to do this. I’m assuming
that you can google for information about where to put your .htaccess
file, and am only going to address some ways to populate your .htaccess
file to accomplish some routine access control and URL rewriting.

For more information on URL rewriting do a google search for:
+mod_rewrite +documentation
or
+mod_rewrite +examples

Topics covered are:


* A brief introduction to Regular Expressions
* Denying access based on IP address of browser
* Denying access based on HTTP_REFERER
* Denying linking or deep linking of images on your site
* Sending naughty users to “good” content elsewhere on the web
* Rewriting URLs in order to remedy an incorrect URL posting to email lists

The statements or directives that are used in my .htaccess file are what are known as "regular expressions". This site seems to be comprehensive, and was at the top of the google search for "regular expressions".

Here are a couple of quick explanations for what's being done in my .htaccess file.

* ^ refers to the beginning of a string. So any regex that has ^ in it means "starts with"
* $ refers to the end of the string. So any regex that ends with $ means "ends with" the previous character
* ! is negation. Anywhere you see a ! (aka "bang"), it means not. so !string, matches everything except string
* . means any single character
* * means one of more of the preceding character. Hence .* is like the UNIX or DOS * wildcard
* | is a logical OR. I explain more about it below.
* \ escapes the "special characters" just described. So if I wanted to match a . I would write the regex \.

So to put a couple of these together:

* ^$ refers to a string that starts and stops with nothing in between, also known as an empty string, as there's nothing in it.
* ^stevem.* refers to a string that starts with stevem and then may contain any number of any characters.
* ^shooter\.net$ refers to a string that contains only "shooter.net"

I describe how I've setup these regexs below using the characters described above...

Use Google for more info.

Another thing to note: In a couple of the directives in my .htaccess file I block some rather large chunks of the Internet from getting to my web site. I'm doing this because I want to be sure that I've blocked irritating (usually referrer spammers -- SCUM) from hitting my site and I've used a whois server to find out what the netblock boundaries are, and blocked the entire netblock. I also keep a close eye on my access log to see who is being denied access to my site. I strongly suggest that if you don't know what you're doing, don't take such extreme measures, and instead block complete single IP addresses instead of networks.

And most of all, be careful and make sure you understand what you are doing in your .htaccess file. Then TEST TEST TEST to make sure it's accomplishing what you want and that you haven't crippled your site.

.htaccess files are powerful tools for fine tuning access to your web site. To paraphrase a famouse comic book uncle, "With great power comes the ablity to really screw things up."

Here is a copy of my .htaccess file with inline comments attempting to describe what I'm doing.

## The next several statements set a variable, (BadAgent, BadReferer, BadAddy)
## if the condition is met. the three deny lines below cause the
## 403 Denied to be issued

## This identifies a couple of irritating User-Agents
SetEnvIfNoCase User-Agent ".*(AdultGods|php/perl).*" BadAgent

## This identifies a couple irritating Referers
SetEnvIfNoCase Referer ".*(x-stories|plattendreher|pureteen|stormfront).*" BadReferer

## This blocks some IP addresses that I don't like this is a rather complex regular expression
## the () symbols contain a group of expressions separated by the pipe symbol |
## the | acts as a logical OR.
## so for example, we match a string that contains one or more of any character: .*
## that contains one of those three IP addresses listed in the ()s
## and then can be followed by one or more of any (or none at all) characaters.
## Do some searching on google if you really want to dig this deeply into regexs
##
SetEnvIfNoCase REMOTE_ADDR ".*(193.194.84.1|69.31.86.133|205.252.49.146).*" BadAddy

## This blocks an entire class B... It's all owned by one ISP, and the
## referrer spam was coming from a dozen or more networks contained in this netblock
SetEnvIfNoCase REMOTE_ADDR "66.154.*" BadAddy

## This blocks two class B networks. Same story as above. However,
## the \. included below makes sure that we only block two networks
## 198.25.0.0 and 198.26.0.0 if we had left out the .\ we would have matched
## HUNDREDS of networks, like 198.251.0.0 and 198.268.0.0 etc...
##
SetEnvIfNoCase REMOTE_ADDR "198.2(5|6)\..*" BadAddy

## Here's where we setup the denials if there were matches above
order deny,allow
deny from env=BadAgent
deny from env=BadReferer
deny from env=BadAddy


## This part of the .htaccess file is where I do some URL rewriting


## Gotta turn on the engine if we're going to get anywhere
RewriteEngine on

## These two rules cause requests for atom and rss requests to the
## docroot of my site to be rewritten to use ExpressionEngine system
## so they are properly served
##
RewriteRule .*atom.xml$ http://www.shooter.net/index.php/weblog/rss_atom/ [R]
RewriteRule .*rss.xml$ http://www.shooter.net/index.php/weblog/rss_2.0/ [R]

## This commented out rule causes all requests for index.html to be
## rewritten to use ExpressionEngine. It was in place as I was migrating
## and I wanted everything directed into EE instead of using my old static pages.
##
#RewriteRule /.*index.html$ http://www.shooter.net/index.php [R]

## I posted an email to a mailing list announcing this article, but I
## somehow managed to screw it up. This entry fixes it so I didn't
## have to post an "I'm a dummy" followup
## see if you can figure out the incorrect URL that I posted
##
RewriteRule ^using-a-polarizer/ http://www.shooter.net/index.php/weblog/Item/using-a-polarizer/ [R]


## This is how I prevent other sites from linking to my images, a practice
## also known as "deep linking". It pisses me off that in spite of my
## copyright notice, forbidding this practice, some people think it's okay
## to steal my work, and steal my bandwidth too...
##
##
## This line says, only match if the HTTP_REFERER environment variable is
## not empty otherwise fall out of this series of conditionals. So if
## there is no HTTP_REFERER set, we drop out of this set of conditionals,
## which means that we don't execute the RewriteRule at the end of this
## block of conditionals.
##
RewriteCond %{HTTP_REFERER} !^$

## This is a case insensitive (because of the [NC] flag) expression
## that causes any referrals from shooter.net to exits from this series
## of conditionals
##
RewriteCond %{HTTP_REFERER} !.*shooter.net.* [NC]

## dicelady.com is my mom's web site. Since we're sharing the installation
## of ExpressionEngine, I want to serve images for her... So if the
## HTTP_REFERER is not from her site we stay in this series of
## conditionals, if it is from her site, the request drops out of this
## series of conditionals,
##
RewriteCond %{HTTP_REFERER} !^http://www.dicelady.com/.*$ [NC]

## Here's the action item. If we have a referrer, and it's not from
## shooter.net or dicelady.com and the request ends in jpg, then no matter
## what image they requested, we're going to serve them up a gif file that
## is a reminder to not steal my images...
##
RewriteRule .*\.jpg$ http://shooter.net/ah.ha.1.gif [R]

## The website name has been changed to prevent the guilty
## from knowing how I did what I did...
##
## This one is kinda fun. There is a website whose domain
## name contains "hatemongersite". They are a white supremecist
## website that was linking to my images of same-sex weddings.
## And saying derogatory ugly things about the people that I
## took pictures of... So, ANY request for ANYTHING from my
## site that originates from this website, gets redirected to
## a TOLERANCE site, and is served a nice pdf pamphlet about
## "promoting tolerance" The nice thing about sending them a
## pdf is most web browsers will launch a help app, adobe acrobat
## to display the pdf... So they are browsing along, and suddenly they
## have a "promoting tolerance" pdf being displayed...
##
RewriteCond %{HTTP_REFERER} ^http://.*hatemongersite.*$ [NC]
RewriteRule .* http://www.tolerance.org/101_tools/101_tools.pdf [R]

## This one is simple. Requests from a site the irritated me
## are rewritten to send them to a site with a very large pipe
## and an 8mb thermal image of tokyo!
##
RewriteCond %{HTTP_REFERER} ^http://.*popmiranda.*$ [NC]
RewriteRule .* http://vvvvvvv.vv.vvv.vvv/gallery/images/tokyo.jpg [R]

## Here's how you can redirect traffic from a certain address or range
## of addresses to another site... This was the network that included
## several users from the hatemongersite (IP changed to protect the guilty)
## I point them at the "promoting tolerance" pdf.
##
RewriteCond %{REMOTE_ADDR} 192.168.10.*
RewriteRule .*\.html$ http://www.tolerance.org/101_tools/101_tools.pdf [R]


So that's about it... I hope you've enjoyed this little jaunt through my .htaccess file.

0 comments: