Bot Detection

Overview

Bot detection is the foundation of RndrKit's pre-rendering service. Every incoming request is analyzed at our edge to determine whether the visitor is a search engine bot or a human user. Bots receive pre-rendered HTML, while humans are proxied to the origin server.

How Detection Works

Our edge inspects the User-Agent header of every request and matches it against a list of known bot patterns. This matching happens before the request reaches the application layer, making it extremely fast with zero performance impact.

map $http_user_agent $is_bot {
    default 0;
    ~*googlebot 1;
    ~*bingbot 1;
    ~*facebookexternalhit 1;
    # ... 100+ patterns
}

The $is_bot variable is set to 1 for bots and 0 for humans. This value is passed to the Express API via the X-Is-Bot header, which determines the request handling path.

Detected Bot Categories

RndrKit detects bots across several categories:

Search Engine Crawlers

The primary target for pre-rendering. These bots index your content for search results.

Bot	Service
Googlebot	Google Search
Bingbot	Microsoft Bing
DuckDuckBot	DuckDuckGo
Baiduspider	Baidu (China)
YandexBot	Yandex (Russia)
Applebot	Apple/Siri
Slurp	Yahoo
SeznamBot	Seznam (Czech)
NaverBot / Yeti	Naver (Korea)
Sogou	Sogou (China)
Qwantify	Qwant (EU)

Google also uses several specialized crawlers that are detected separately: Google-InspectionTool, AdsBot-Google, Mediapartners-Google, and FeedFetcher-Google.

These bots fetch your pages to generate link previews when someone shares your URL.

Bot	Service
facebookexternalhit	Facebook
Twitterbot	Twitter/X
LinkedInBot	LinkedIn
PinterestBot	Pinterest
WhatsApp	WhatsApp
Slackbot	Slack
TelegramBot	Telegram
DiscordBot	Discord
RedditBot	Reddit
Mastodon	Mastodon

Pre-rendering for social media crawlers ensures your Open Graph tags, Twitter Cards, and preview images display correctly in link previews.

AI and LLM Crawlers

Modern AI companies crawl the web for training data and search augmentation.

Bot	Company
GPTBot	OpenAI
ChatGPT-User	OpenAI
OAI-SearchBot	OpenAI
ClaudeBot	Anthropic
Anthropic-AI	Anthropic
CCBot	Common Crawl
Google-Extended	Google AI
PerplexityBot	Perplexity
Bytespider	ByteDance
AmazonBot	Amazon
Meta-ExternalAgent	Meta

You can block AI crawlers from accessing your content using robots.txt rules.

SEO Tool Crawlers

Professional SEO tools crawl your site for analysis.

Bot	Tool
SemrushBot	Semrush
AhrefsBot	Ahrefs
MJ12bot	Majestic
DotBot	Moz
Screaming	Screaming Frog
DataForSeo	DataForSEO

Excluded Bots

Some bots are explicitly excluded from pre-rendering even though their user-agent strings would normally match generic bot patterns. These bots get proxied straight to your origin just like human visitors, so they do not consume any of your monthly renders:

Bot	Service	Why Excluded
DotBot	Moz	Research crawler -- wastes renders without SEO benefit
UptimeRobot	UptimeRobot	Uptime monitoring -- just checks if your site is up

This is handled at the edge using a first-match-wins approach. The exclusion rules for DotBot and UptimeRobot are placed before the generic ~*bot pattern so they match first and get set to 0 (not a bot), even though the generic pattern would otherwise catch them.

map $http_user_agent $is_bot {
    default 0;
    ~*dotbot 0;         # Exclude DotBot -- must come BEFORE generic ~*bot
    ~*uptimerobot 0;    # Exclude UptimeRobot
    ~*googlebot 1;
    ~*bot[^a-z] 1;      # Generic fallback -- catches remaining bots
}

If you ever need to exclude additional bots, the key thing to remember is that the exclusion rule must appear before the generic pattern that would otherwise match it.

Generic Patterns

As a fallback, RndrKit detects bots using generic patterns that catch crawlers not explicitly named:

User-agents containing bot, spider, crawl, or preview
Headless browsers (HeadlessChrome, PhantomJS)
Performance tools (Lighthouse, PageSpeed, GTmetrix)
Monitoring services (Pingdom, StatusCake)

Request Flow After Detection

Incoming Request
    |
    v
Edge: Check User-Agent
    |
    +--> is_bot = 1
    |       |
    |       v
    |    Set X-Is-Bot: 1 header
    |       |
    |       v
    |    Apply bot rate limit (2 req/sec per IP)
    |       |
    |       v
    |    Forward to Express API -> Render/Cache path
    |
    +--> is_bot = 0
            |
            v
         Set X-Is-Bot: 0 header
            |
            v
         Forward to Express API -> Proxy to origin

Rate Limiting

Bot traffic is rate-limited separately from human traffic to prevent abuse:

Human traffic: 10 requests per second per IP with burst of 20
Bot traffic: 2 requests per second per IP with burst of 5

This prevents aggressive crawlers from overwhelming the rendering pipeline while still allowing legitimate crawling.

False Positives and Negatives

False Positives

Occasionally, a human user might have a user-agent string that triggers bot detection. This is rare because the patterns are specific to known bot identifiers. If it does happen, the user would receive pre-rendered HTML instead of being proxied -- the content is the same, so the impact is minimal.

False Negatives

Some bots use custom or generic user-agent strings that do not match any pattern. These bots will be proxied to the origin like human users. RndrKit's pattern list is regularly updated to cover new bots as they appear.

Verifying Bot Detection

You can test whether bot detection is working for your domain:

# Should receive pre-rendered HTML (bot)
curl -s -A "Googlebot/2.1" "https://www.example.com/" | head -20

# Should receive origin response (human)
curl -s -A "Mozilla/5.0" "https://www.example.com/" | head -20

The pre-rendered response will contain fully rendered HTML with all content, while the origin response will typically be a minimal SPA shell.

Next Steps

Rendering Pipeline -- What happens after a bot is detected
Origin Proxy -- How human requests are handled
Robots.txt Editor -- Control which bots can access your content