|
 |
|
|
|
 |  |  |
 |
 |
|
 |
10:23 PM EST, Sunday Nov 29, 2009
|
 |
 |
how to only allow major search engines to access site and hu |
 |
 |
I have a bad problem with bad bots and only want to allow yahoo, google, ask and bing to crawl my site and block all other bots / crawlers.
I want humans to browse my site aswell though and I only want to do it via htaccess because most bad bots ignore robots.txt
any ideas?
|
|
|
 |
 |
 |
|
 |
 |
 |
 |
|
 |
07:08 PM EST, Tuesday Dec 01, 2009
|
 |
 |
Easy provided you know the IP's of the bots.
sample .htaccess
--
order allow,deny
deny from 123.45.6.7
deny from 012.34.5.
allow from all
--
Will allow all but listed as deny
You can also append the below anti-bot code and add new bots provided they identify themselves.
--
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
--
The above will issue a 403 server error message to the listed bots.
Apache Rewrite Conditions is highly google search-able and you will get a great amount of info in the various options available under this mod.
The only hitch is you have to be able to identify a bot (or it's IP) before you can block it.
_________________
Johnny Jones has ordered that all people who don't meet his personal standards of having a right to speak on YNOT be silent.
|
|
|
 |
 |
 |
|
 |
 |
 |
 |
|
 |
10:13 PM EST, Tuesday Dec 01, 2009
|
 |
 |
This works well for Referrers:
| Code: | RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(www\.)?quicklex\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(www\.)?kpnvandaag\.nl/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(www\.)?dailykeys\.com/ [NC]
RewriteRule .* - [F,L] |
Just thought I'd add it to the above.
_________________
Live Sex Cams XloveCam.com Adult Affiliate Program - AWCM - 30% Lifetime Revenue Share +more ...

|
|
|
 |
 |
 |
|
 |
 |
 |
 |
|
 |
07:15 PM EST, Monday Dec 07, 2009
|
 |
 |
maybe this is what I could do,
I could allow ip ranges of major search engines and also put a robots txt file on my server with a test to deny crawling a specific directory, if the bot follows this rule its a good bot , if it doesnt then I dont want it crawling.
The thing with the ip's is google keeps changing its ip addresses so if theres one not in the .htaccess list that would hurt me.
Thanks guys that was helpful
|
|
|
 |
 |
 |
|
 |
 |
 |
 |
|
 |
03:21 AM EST, Wednesday Dec 09, 2009
|
 |
 |
You can get really cleaver and when the bot ignores the robot.txt you can capture the IP while they crawl that rogue directory and add it to a deny list.
A little bit of scripting, but there is nothing like automation.
_________________
Johnny Jones has ordered that all people who don't meet his personal standards of having a right to speak on YNOT be silent.
|
|
|
 |
 |
 |
|
 |
 |
| Chat » Tech Talk » how to only allow major search engines to access site and hu |
| Jump To: |  |
|
|
 |
|
 |
|
 |
|