Bot Detection

Overview

Bot detection is the foundation of RndrKit's pre-rendering service. Every incoming request is analyzed at the Nginx level to determine whether the visitor is a search engine bot or a human user. Bots receive pre-rendered HTML, while humans are proxied to the origin server.

How Detection Works

Nginx inspects the User-Agent header of every request and matches it against a list of known bot patterns. This matching happens before the request reaches the application layer, making it extremely fast with zero performance impact.

map $http_user_agent $is_bot {
    default 0;
    ~*googlebot 1;
    ~*bingbot 1;
    ~*facebookexternalhit 1;
    # ... 100+ patterns
}

The $is_bot variable is set to 1 for bots and 0 for humans. This value is passed to the Express API via the X-Is-Bot header, which determines the request handling path.

Detected Bot Categories

RndrKit detects bots across several categories:

Search Engine Crawlers

The primary target for pre-rendering. These bots index your content for search results.

BotService
GooglebotGoogle Search
BingbotMicrosoft Bing
DuckDuckBotDuckDuckGo
BaiduspiderBaidu (China)
YandexBotYandex (Russia)
ApplebotApple/Siri
SlurpYahoo
SeznamBotSeznam (Czech)
NaverBot / YetiNaver (Korea)
SogouSogou (China)
QwantifyQwant (EU)

Google also uses several specialized crawlers that are detected separately: Google-InspectionTool, AdsBot-Google, Mediapartners-Google, and FeedFetcher-Google.

Social Media Crawlers

These bots fetch your pages to generate link previews when someone shares your URL.

BotService
facebookexternalhitFacebook
TwitterbotTwitter/X
LinkedInBotLinkedIn
PinterestBotPinterest
WhatsAppWhatsApp
SlackbotSlack
TelegramBotTelegram
DiscordBotDiscord
RedditBotReddit
MastodonMastodon

Pre-rendering for social media crawlers ensures your Open Graph tags, Twitter Cards, and preview images display correctly in link previews.

AI and LLM Crawlers

Modern AI companies crawl the web for training data and search augmentation.

BotCompany
GPTBotOpenAI
ChatGPT-UserOpenAI
OAI-SearchBotOpenAI
ClaudeBotAnthropic
Anthropic-AIAnthropic
CCBotCommon Crawl
Google-ExtendedGoogle AI
PerplexityBotPerplexity
BytespiderByteDance
AmazonBotAmazon
Meta-ExternalAgentMeta

You can block AI crawlers from accessing your content using robots.txt rules.

SEO Tool Crawlers

Professional SEO tools crawl your site for analysis.

BotTool
SemrushBotSemrush
AhrefsBotAhrefs
MJ12botMajestic
DotBotMoz
ScreamingScreaming Frog
DataForSeoDataForSEO

Excluded Bots

Some bots are explicitly excluded from pre-rendering even though their user-agent strings would normally match generic bot patterns. These bots get proxied straight to your origin just like human visitors, so they do not consume any of your monthly renders:

BotServiceWhy Excluded
DotBotMozResearch crawler -- wastes renders without SEO benefit
UptimeRobotUptimeRobotUptime monitoring -- just checks if your site is up

This is handled using Nginx's map directive, which uses a first-match-wins approach. The exclusion rules for DotBot and UptimeRobot are placed before the generic ~*bot pattern so they match first and get set to 0 (not a bot), even though the generic pattern would otherwise catch them.

map $http_user_agent $is_bot {
    default 0;
    ~*dotbot 0;         # Exclude DotBot -- must come BEFORE generic ~*bot
    ~*uptimerobot 0;    # Exclude UptimeRobot
    ~*googlebot 1;
    ~*bot[^a-z] 1;      # Generic fallback -- catches remaining bots
}

If you ever need to exclude additional bots, the key thing to remember is that the exclusion rule must appear before the generic pattern that would otherwise match it.

Generic Patterns

As a fallback, RndrKit detects bots using generic patterns that catch crawlers not explicitly named:

  • User-agents containing bot, spider, crawl, or preview
  • Headless browsers (HeadlessChrome, PhantomJS)
  • Performance tools (Lighthouse, PageSpeed, GTmetrix)
  • Monitoring services (Pingdom, StatusCake)

Request Flow After Detection

Incoming Request
    |
    v
Nginx: Check User-Agent
    |
    +--> is_bot = 1
    |       |
    |       v
    |    Set X-Is-Bot: 1 header
    |       |
    |       v
    |    Apply bot rate limit (2 req/sec per IP)
    |       |
    |       v
    |    Forward to Express API -> Render/Cache path
    |
    +--> is_bot = 0
            |
            v
         Set X-Is-Bot: 0 header
            |
            v
         Forward to Express API -> Proxy to origin

Rate Limiting

Bot traffic is rate-limited separately from human traffic to prevent abuse:

  • Human traffic: 10 requests per second per IP with burst of 20
  • Bot traffic: 2 requests per second per IP with burst of 5

This prevents aggressive crawlers from overwhelming the rendering pipeline while still allowing legitimate crawling.

False Positives and Negatives

False Positives

Occasionally, a human user might have a user-agent string that triggers bot detection. This is rare because the patterns are specific to known bot identifiers. If it does happen, the user would receive pre-rendered HTML instead of being proxied -- the content is the same, so the impact is minimal.

False Negatives

Some bots use custom or generic user-agent strings that do not match any pattern. These bots will be proxied to the origin like human users. RndrKit's pattern list is regularly updated to cover new bots as they appear.

Verifying Bot Detection

You can test whether bot detection is working for your domain:

# Should receive pre-rendered HTML (bot)
curl -s -A "Googlebot/2.1" "https://www.example.com/" | head -20

# Should receive origin response (human)
curl -s -A "Mozilla/5.0" "https://www.example.com/" | head -20

The pre-rendered response will contain fully rendered HTML with all content, while the origin response will typically be a minimal SPA shell.

Next Steps