Bot Detection
Overview
Bot detection is the foundation of RndrKit's pre-rendering service. Every incoming request is analyzed at the Nginx level to determine whether the visitor is a search engine bot or a human user. Bots receive pre-rendered HTML, while humans are proxied to the origin server.
How Detection Works
Nginx inspects the User-Agent header of every request and matches it against a list of known bot patterns. This matching happens before the request reaches the application layer, making it extremely fast with zero performance impact.
map $http_user_agent $is_bot {
default 0;
~*googlebot 1;
~*bingbot 1;
~*facebookexternalhit 1;
# ... 100+ patterns
}
The $is_bot variable is set to 1 for bots and 0 for humans. This value is passed to the Express API via the X-Is-Bot header, which determines the request handling path.
Detected Bot Categories
RndrKit detects bots across several categories:
Search Engine Crawlers
The primary target for pre-rendering. These bots index your content for search results.
| Bot | Service |
|---|---|
| Googlebot | Google Search |
| Bingbot | Microsoft Bing |
| DuckDuckBot | DuckDuckGo |
| Baiduspider | Baidu (China) |
| YandexBot | Yandex (Russia) |
| Applebot | Apple/Siri |
| Slurp | Yahoo |
| SeznamBot | Seznam (Czech) |
| NaverBot / Yeti | Naver (Korea) |
| Sogou | Sogou (China) |
| Qwantify | Qwant (EU) |
Google also uses several specialized crawlers that are detected separately: Google-InspectionTool, AdsBot-Google, Mediapartners-Google, and FeedFetcher-Google.
Social Media Crawlers
These bots fetch your pages to generate link previews when someone shares your URL.
| Bot | Service |
|---|---|
| facebookexternalhit | |
| Twitterbot | Twitter/X |
| LinkedInBot | |
| PinterestBot | |
| Slackbot | Slack |
| TelegramBot | Telegram |
| DiscordBot | Discord |
| RedditBot | |
| Mastodon | Mastodon |
Pre-rendering for social media crawlers ensures your Open Graph tags, Twitter Cards, and preview images display correctly in link previews.
AI and LLM Crawlers
Modern AI companies crawl the web for training data and search augmentation.
| Bot | Company |
|---|---|
| GPTBot | OpenAI |
| ChatGPT-User | OpenAI |
| OAI-SearchBot | OpenAI |
| ClaudeBot | Anthropic |
| Anthropic-AI | Anthropic |
| CCBot | Common Crawl |
| Google-Extended | Google AI |
| PerplexityBot | Perplexity |
| Bytespider | ByteDance |
| AmazonBot | Amazon |
| Meta-ExternalAgent | Meta |
You can block AI crawlers from accessing your content using robots.txt rules.
SEO Tool Crawlers
Professional SEO tools crawl your site for analysis.
| Bot | Tool |
|---|---|
| SemrushBot | Semrush |
| AhrefsBot | Ahrefs |
| MJ12bot | Majestic |
| DotBot | Moz |
| Screaming | Screaming Frog |
| DataForSeo | DataForSEO |
Excluded Bots
Some bots are explicitly excluded from pre-rendering even though their user-agent strings would normally match generic bot patterns. These bots get proxied straight to your origin just like human visitors, so they do not consume any of your monthly renders:
| Bot | Service | Why Excluded |
|---|---|---|
| DotBot | Moz | Research crawler -- wastes renders without SEO benefit |
| UptimeRobot | UptimeRobot | Uptime monitoring -- just checks if your site is up |
This is handled using Nginx's map directive, which uses a first-match-wins approach. The exclusion rules for DotBot and UptimeRobot are placed before the generic ~*bot pattern so they match first and get set to 0 (not a bot), even though the generic pattern would otherwise catch them.
map $http_user_agent $is_bot {
default 0;
~*dotbot 0; # Exclude DotBot -- must come BEFORE generic ~*bot
~*uptimerobot 0; # Exclude UptimeRobot
~*googlebot 1;
~*bot[^a-z] 1; # Generic fallback -- catches remaining bots
}
If you ever need to exclude additional bots, the key thing to remember is that the exclusion rule must appear before the generic pattern that would otherwise match it.
Generic Patterns
As a fallback, RndrKit detects bots using generic patterns that catch crawlers not explicitly named:
- User-agents containing
bot,spider,crawl, orpreview - Headless browsers (
HeadlessChrome,PhantomJS) - Performance tools (
Lighthouse,PageSpeed,GTmetrix) - Monitoring services (
Pingdom,StatusCake)
Request Flow After Detection
Incoming Request
|
v
Nginx: Check User-Agent
|
+--> is_bot = 1
| |
| v
| Set X-Is-Bot: 1 header
| |
| v
| Apply bot rate limit (2 req/sec per IP)
| |
| v
| Forward to Express API -> Render/Cache path
|
+--> is_bot = 0
|
v
Set X-Is-Bot: 0 header
|
v
Forward to Express API -> Proxy to origin
Rate Limiting
Bot traffic is rate-limited separately from human traffic to prevent abuse:
- Human traffic: 10 requests per second per IP with burst of 20
- Bot traffic: 2 requests per second per IP with burst of 5
This prevents aggressive crawlers from overwhelming the rendering pipeline while still allowing legitimate crawling.
False Positives and Negatives
False Positives
Occasionally, a human user might have a user-agent string that triggers bot detection. This is rare because the patterns are specific to known bot identifiers. If it does happen, the user would receive pre-rendered HTML instead of being proxied -- the content is the same, so the impact is minimal.
False Negatives
Some bots use custom or generic user-agent strings that do not match any pattern. These bots will be proxied to the origin like human users. RndrKit's pattern list is regularly updated to cover new bots as they appear.
Verifying Bot Detection
You can test whether bot detection is working for your domain:
# Should receive pre-rendered HTML (bot)
curl -s -A "Googlebot/2.1" "https://www.example.com/" | head -20
# Should receive origin response (human)
curl -s -A "Mozilla/5.0" "https://www.example.com/" | head -20
The pre-rendered response will contain fully rendered HTML with all content, while the origin response will typically be a minimal SPA shell.
Next Steps
- Rendering Pipeline -- What happens after a bot is detected
- Origin Proxy -- How human requests are handled
- Robots.txt Editor -- Control which bots can access your content