Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClaudeBot causing collection#show performance problems #4130

Closed
benwbrum opened this issue May 27, 2024 · 6 comments · Fixed by #4131, #4133 or #4138
Closed

ClaudeBot causing collection#show performance problems #4130

benwbrum opened this issue May 27, 2024 · 6 comments · Fixed by #4131, #4133 or #4138
Assignees

Comments

@benwbrum
Copy link
Owner

The collection#show action is getting hammered; possibly by bots. This seems to have brought perfomance to a stand-still at different times of day:

Screenshot from 2024-05-27 07-12-15

@benwbrum
Copy link
Owner Author

benwbrum commented May 27, 2024

This looks like another ill-behaved spider problem.

During a 3-hour window--the beginning half of the graph above--we saw

  • 141950 requests to collection#show.
  • 140818 requests were to {"collection_id"=>25000140, "collection_title"=>"Indy Parks and Recreation "}
  • 94% of requests to collection#show during this window were by ClaudeBot, distributed across a range of IP addresses
2.7.3 :010 > sample = show_events.sample(100)
 => [#<Ahoy::Event id: 298399469, visit_id: 209692590, user_id: nil, name: "collection#show", properties: {"collecti... 
2.7.3 :011 > sample.map{|event| event.visit.browser}.tally
 => {"ClaudeBot"=>96, "Chrome Mobile"=>4} 
2.7.3 :012 > sample.map{|event| event.visit.user_agent}.tally
 => {"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"=>96, "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.65 Mobile Safari/537.36 (compatible; GoogleOther)"=>4} 
2.7.3 :014 > pp sample.map{|event| event.visit.ip}.tally
{"3.138.172.82"=>1,
 "3.142.166.143"=>2,
 "3.145.49.160"=>1,
 "3.147.68.236"=>1,
 "18.220.148.149"=>1,
 "3.21.46.181"=>1,
 "3.21.234.238"=>2,
 "3.137.163.44"=>1,
 "3.129.67.59"=>1,
 "3.129.21.37"=>1,
 "3.146.255.113"=>1,
 "66.249.66.3"=>1,
 "18.219.244.12"=>1,
 "18.191.225.36"=>1,
 "66.249.66.22"=>2,
 "3.128.226.23"=>1,
 "3.144.2.77"=>1,
 "18.118.210.41"=>1,
 "18.224.31.50"=>1,
 "3.142.196.223"=>1,
 "3.142.42.136"=>1,
 "18.118.19.223"=>1,
 "3.145.41.49"=>1,
 "3.147.238.61"=>1,
 "3.12.102.118"=>1,
 "18.188.180.136"=>1,
 "3.138.69.172"=>1,
 "3.145.34.109"=>1,
 "52.14.239.105"=>1,
 "3.144.151.46"=>1,
 "3.23.126.48"=>1,
 "18.118.159.232"=>1,
 "18.223.188.202"=>1,
 "3.22.77.79"=>1,
 "3.147.43.233"=>1,
 "18.219.190.220"=>1,
 "3.147.140.206"=>1,
 "18.217.232.226"=>1,
 "3.141.47.178"=>1,
 "18.117.154.219"=>1,
 "52.14.6.128"=>1,
 "18.223.205.116"=>1,
 "3.17.181.181"=>1,
 "18.221.117.51"=>1,
 "18.119.119.137"=>1,
 "18.220.164.222"=>1,
 "18.224.19.167"=>1,
 "18.222.12.201"=>1,
 "52.15.137.232"=>1,
 "3.14.64.41"=>1,
 "52.14.196.241"=>1,
 "3.17.29.195"=>2,
 "3.148.145.236"=>1,
 "18.118.135.213"=>1,
 "18.219.73.146"=>1,
 "13.58.53.247"=>1,
 "18.223.211.51"=>1,
 "13.58.242.200"=>1,
 "18.191.216.97"=>1,
 "3.142.250.95"=>1,
 "3.14.149.176"=>1,
 "52.14.105.188"=>1,
 "18.188.90.132"=>1,
 "3.12.152.21"=>1,
 "3.22.61.226"=>1,
 "3.22.118.21"=>1,
 "52.15.233.135"=>1,
 "3.144.237.77"=>2,
 "3.133.131.32"=>1,
 "18.119.133.72"=>1,
 "3.16.25.91"=>1,
 "18.216.27.23"=>1,
 "3.16.67.33"=>1,
 "3.128.197.221"=>1,
 "52.15.48.176"=>1,
 "3.144.211.45"=>1,
 "18.118.28.34"=>1,
 "3.128.201.209"=>1,
 "18.117.231.6"=>1,
 "66.249.66.23"=>1,
 "3.145.204.201"=>1,
 "13.58.82.135"=>1,
 "3.129.18.221"=>1,
 "18.117.102.235"=>1,
 "18.117.172.160"=>1,
 "18.191.193.231"=>1,
 "3.146.176.193"=>1,
 "3.133.103.100"=>1,
 "3.15.212.91"=>1,
 "3.134.85.72"=>1,
 "18.223.206.160"=>1,
 "3.145.78.155"=>1,
 "3.144.187.55"=>1,
 "18.219.116.162"=>1,
 "3.138.246.227"=>1}

It looks like ClaudeBot is getting confused by facets, since each visit in the sample has a unique landing page URL, which look something like this: https://fromthepage.com/digitalindy/ipr?search%5Bs2%5D%5B%5D=IPR-Box001_077.jpg&search%5Bs2%5D%5B%5D=IPR-Box016_230.jpg&search%5Bs2%5D%5B%5D=IPR-Box021_193&search%5Bs2%5D%5B%5D=IPR-Box026_863.jpg&search%5Bwork-collection_id%5D=25000140

@benwbrum benwbrum changed the title collection#show performance problems ClaudeBot causing collection#show performance problems May 27, 2024
@benwbrum benwbrum self-assigned this May 27, 2024
@benwbrum
Copy link
Owner Author

Several other bots are also crawling, include Bytespider, which does not, however, identify itself by browser:

Visit.where(started_at: [(Time.now - 1.hour)..]).pluck(:browser).tally
 => {"Chrome Mobile"=>2861, "ClaudeBot"=>7553, "Amazonbot"=>891, "Mobile Safari"=>11, "CFNetwork"=>9, "PetalBot"=>162, "Chrome"=>75, "GPTBot"=>133, "Firefox"=>8, "Other"=>7, "Site24x7"=>6, "Safari"=>6, "HeadlessChrome"=>3, "IE"=>4, "Edge"=>5, "Samsung Internet"=>1, "Apple Mail"=>1} 

@benwbrum
Copy link
Owner Author

Added additional agents based on this:

fromthepage@fromthepage:~/deployment/current$ sudo grep -E "bot|spider|crawl|slurp|mediapartners|Feedfetcher|bingpreview|facebookexternalhit|monitoring" /var/log/apache2/other_vhosts_access.log | awk -F\" '{print $6}' | sort | uniq -c | sort -nr
 492313 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
  28721 Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
  25740 Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)
  25305 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
  19757 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  14054 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
  13638 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
  11455 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
   5640 Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)
   4092 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
   3564 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
   3035 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.65 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
   1815 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
    480 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
    392 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
    272 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.66 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    230 Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)
    208 Googlebot-Image/1.0
    192 CCBot/2.0 (https://commoncrawl.org/faq/)
    180 Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)
    161 Mozilla/5.0 (compatible; MojeekBot/0.11; +https://www.mojeek.com/bot.html)
    148 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
     51 Turnitin (https://bit.ly/2UvnfoQ)
     28 DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)
     24 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/112.0.0.0 Safari/537.36

@benwbrum
Copy link
Owner Author

The throttling does not appear to be working.

To test, execute this from a laptop and watch for 529 response codes:

60.times do 
  print `curl -s -o /dev/null  -w "%{http_code}" -A 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, lko; compatible; ClaudeBot/1.0;+claudebot@anthropic.com)' https://fromthepage.com/benwbrum`
  print "\n\n"
end

@benwbrum
Copy link
Owner Author

This is still not working, even after adding a blocklist. It looks like our addition of Rails.middleware.use Rack::Attack is re-loading the rack-attack gem and overwriting the configuration in the initializers, or perhaps isn't getting executed at all?

Production

2.7.3 :006 > Rack::Attack.blocklists
 => {} 
2.7.3 :007 > Rack::Attack.safelists
 => {} 
fromthepage@fromthepage:~/deployment/current$ RAILS_ENV=production bundle exec rails middleware | grep Attack
use Rack::Attack

Development

2.7.3 :005 > Rack::Attack.blocklists
 => {"block bad bots"=>#<Rack::Attack::Blocklist:0x000056104a6d08e8 @name="block bad bots", @block=#<Proc:0x000056104a6d0898 /home/benwbrum/dev/products/fromthepage/fromthepage/config/initializers/rack_attack.rb:43>, @type=:blocklist>} 
2.7.3 :006 > Rack::Attack.safelists
 => {"allow from localhost"=>#<Rack::Attack::Safelist:0x000056104a6d2990 @name="allow from localhost", @block=#<Proc:0x000056104a6d2918 /home/benwbrum/dev/products/fromthepage/fromthepage/config/initializers/rack_attack.rb:19>, @type=:safelist>} 
2.7.3 :007 > exit
benwbrum@sparckjones:~/dev/products/fromthepage/fromthepage$ rails middleware | grep Attack
use Rack::Attack
use Rack::Attack

@benwbrum benwbrum reopened this May 29, 2024
benwbrum added a commit that referenced this issue May 29, 2024
benwbrum added a commit that referenced this issue May 30, 2024
@benwbrum
Copy link
Owner Author

Success!

$ curl -A 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)' https://fromthepage.com/benwbrum
Forbidden

saracarl added a commit that referenced this issue Nov 18, 2024
…nitializer

Externalize rack attack initializer #4130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant