Blocking Fake Googlebot (and Bingbot) Traffic With Nginx

Take a look at these access log lines. Every one of them claims to be Googlebot:

206.206.73.82 - - [15/Jun/2026:07:58:15 +0000] "GET /shop/category/outdoor/?colour=green&sort=price HTTP/2.0" 504 176 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
45.43.93.112 - - [15/Jun/2026:07:58:16 +0000] "GET /shop/category/outdoor/?brand=acme&page=14 HTTP/2.0" 200 46421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
104.252.28.99 - - [15/Jun/2026:07:58:16 +0000] "GET /shop/category/footwear/?size=10&colour=black HTTP/2.0" 504 176 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
23.27.236.132 - - [15/Jun/2026:07:58:17 +0000] "GET /shop/category/footwear/?brand=acme&sort=newest HTTP/2.0" 200 44512 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Not one of them is Googlebot. They are scrapers and load-testing nuisances wearing Googlebot’s User-Agent so they don’t get blocked, crawling expensive faceted URLs at machine speed and pushing the origin into 504s. The real Googlebot does not behave like this, and it does not come from a scattergun of random consumer ISP ranges.

The obvious reaction - block anything with “Googlebot” in the User-Agent - is exactly the wrong move. That string is also how the real Googlebot identifies itself, so a blanket block would shut Google out of your site and take your rankings with it. You need to separate the genuine crawler from the impostors, and the User-Agent alone cannot do that.

The good news is that Google and Bing both tell you precisely which IP addresses their crawlers use. This post wires that into nginx: verify the claim against the official ranges, and silently drop anything that fails. Genuine crawlers are never touched.

TL;DR: the whole thing is on GitHub at robwent/nginx-block-fake-googlebot. This is a companion to my earlier piece on integrating AbuseIPDB with the Nginx Bad Bot Blocker, which handles the IPs that don’t bother to pretend.

How Google and Bing let you verify their crawlers

Google documents two ways to confirm a request is really from one of its crawlers. The first is reverse DNS: do a reverse lookup on the IP, check the hostname ends in googlebot.com, google.com or googleusercontent.com, then a forward lookup to confirm it resolves back. It is accurate, but it means a pair of DNS lookups per request, which is a poor fit for a hot nginx path.

The second is the one we want: Google and Bing each publish the full list of IP ranges their crawlers use as a JSON file, in CIDR notation.

Google (common crawlers, including Googlebot): common-crawlers.json
Bing (bingbot / msnbot): bingbot.json

One thing worth flagging if you go looking for older guides: in March 2026 Google moved these files from the legacy /search/apis/ipranges/ path to /crawling/ipranges/, and renamed the Googlebot file to common-crawlers.json. A lot of scripts online still point at the dead googlebot.json URL. The one above is current.

Both files share the same simple shape - a prefixes array of ipv4Prefix and ipv6Prefix entries:

{
  "creationTime": "2026-05-21T14:46:08.000000",
  "prefixes": [
    { "ipv6Prefix": "2001:4860:4801:10::/64" },
    { "ipv4Prefix": "66.249.64.0/27" },
    { "ipv4Prefix": "192.178.4.0/27" }
  ]
}

The plan: turn those prefixes into an nginx allowlist, check whether the client IP is on it, and cross-reference that against what the User-Agent claims to be.

The nginx approach

Three building blocks do the work, all in the http{} context.

A geo block flags whether the client IP is a verified crawler address. geo compiles its ranges into a radix tree at config-load time, so the per-request lookup is effectively constant-time no matter how many hundreds of ranges are in the list. One block for Google, one for Bing.

A map reads the User-Agent and decides which crawler, if any, it is claiming to be.

A final map combines the two into a single $fake_bot flag: it is set to 1 only when the claim and the IP disagree - the User-Agent says Googlebot but the IP is not a Google address, or it says Bingbot but the IP is not a Bing address. Real crawlers (right UA, right IP) and ordinary human traffic (no crawler UA) both fall through to 0.

Then a one-line if in each server block drops the fakes.

The files

There are two static config files plus a script. First, verified-bots.conf - the definitions, included once in http{}. The two geo blocks pull in the generated IP lists; the maps do the logic:

# verified-bots.conf
#
# Detects spoofed search-engine crawlers. A request whose User-Agent claims to
# be Googlebot or Bingbot but whose IP is NOT in that provider's official
# published ranges is flagged as fake ($fake_bot = 1).

# 1 if the client IP is a verified Googlebot (common-crawler) address.
geo $is_googlebot_ip {
    default 0;
    include /etc/nginx/fake-googlebot/googlebot-ips.conf;
}

# 1 if the client IP is a verified Bingbot address.
geo $is_bingbot_ip {
    default 0;
    include /etc/nginx/fake-googlebot/bingbot-ips.conf;
}

# Which crawler (if any) the User-Agent claims to be. Case-insensitive.
# Googlebot-Image/-Video/-News share the common-crawler ranges, so ~*googlebot
# covers them. AdsBot-Google / Mediapartners-Google / Google-InspectionTool are
# deliberately NOT matched: they use different ranges (special-crawlers /
# user-triggered fetchers) and would false-positive against this list.
map $http_user_agent $claimed_bot {
    default         "";
    "~*googlebot"   google;
    "~*bingbot"     bing;
    "~*msnbot"      bing;
    "~*bingpreview" bing;
    "~*adidxbot"    bing;
}

# Composite decision. Key is "claim:isGoogleIP:isBingIP".
# Only the spoofed combinations are listed; everything else (real verified bots,
# ordinary human traffic) falls through to the default 0.
map "$claimed_bot:$is_googlebot_ip:$is_bingbot_ip" $fake_bot {
    default 0;

    # Claims Googlebot but is not on a Google IP -> fake
    "google:0:0" 1;
    "google:0:1" 1;

    # Claims Bingbot but is not on a Bing IP -> fake
    "bing:0:0"   1;
    "bing:1:0"   1;
}

That composite key is the heart of it. $claimed_bot is exactly one value, so for a request claiming Googlebot the key is either google:1:... (on a Google IP, allowed) or google:0:... (not on a Google IP, fake). Listing only the fake combinations keeps the intent obvious and lets everything else default to safe.

Second, fake-bot-block.conf - the enforcement, included in each server block you want protected:

# fake-bot-block.conf
#
# Drops requests that spoof Googlebot/Bingbot. 444 closes the connection with
# no response. Depends on the $fake_bot variable defined in verified-bots.conf.

if ($fake_bot) {
    return 444;
}

A bare if that contains only a return is one of the documented-safe uses of if in nginx, so it is fine at server level. 444 is nginx’s special “close the connection with no response” code.

Keeping the lists up to date

The two *-ips.conf files referenced by the geo blocks are generated by this script. It fetches both providers’ JSON, converts each prefix into a geo value line, and reloads nginx - but only if a list actually changed and nginx -t passes first, so a bad fetch can never wedge a running server. Each provider is isolated, so a Bing outage will not wipe your Google list, and it refuses to write a suspiciously short list. It is also distro-agnostic: paths are configurable and the reload falls back across systemctl, service and nginx -s reload.

#!/usr/bin/env bash
#
# update-verified-bot-ips.sh
#
# Fetches the official Googlebot and Bingbot IP ranges and writes them as nginx
# `geo` value files, consumed by verified-bots.conf.
#
# Usage: ./update-verified-bot-ips.sh [install_dir]
#   install_dir   Default: /etc/nginx/fake-googlebot
#                 Must match the include paths inside verified-bots.conf.
#
# Environment overrides:
#   INSTALL_DIR   Same as the positional argument.
#   RELOAD_CMD    Full reload command (e.g. "systemctl reload nginx").
#
# Dependencies: bash, curl, jq, nginx
#
set -euo pipefail

INSTALL_DIR="${INSTALL_DIR:-${1:-/etc/nginx/fake-googlebot}}"

# Official published sources (verified June 2026).
# Google moved these from /search/apis/ipranges/ to /crawling/ipranges/ in
# March 2026 - the old googlebot.json path is dead.
GOOGLE_URL="https://developers.google.com/static/crawling/ipranges/common-crawlers.json"
BING_URL="https://www.bing.com/toolbox/bingbot.json"

# Refuse to write a list shorter than this many prefixes (guards against a
# truncated / partial response silently shrinking the allowlist).
MIN_PREFIXES=20

RELOAD_CMD="${RELOAD_CMD:-}"

CURL_OPTS="--fail --silent --show-error --location --max-time 30 --retry 3 --retry-delay 5"

log() { printf '%s %s\n' "$(date -u +%FT%TZ)" "$*"; }
err() { log "ERROR: $*" >&2; }

for bin in curl jq nginx; do
    command -v "$bin" >/dev/null 2>&1 || { err "required command '$bin' not found in PATH"; exit 1; }
done

mkdir -p "$INSTALL_DIR"

TMP_DIR="$(mktemp -d)"
# shellcheck disable=SC2064
trap "rm -rf '$TMP_DIR'" EXIT

CHANGED=0

# build_list <name> <url> <output_file>
build_list() {
    name="$1"; url="$2"; out="$3"
    json="${TMP_DIR}/${name}.json"
    staged="${out}.new"   # same directory as $out => atomic rename

    # shellcheck disable=SC2086
    if ! curl $CURL_OPTS -A "verified-bot-ip-updater/1.0 (+nginx)" -o "$json" "$url"; then
        err "fetch failed for ${name} (${url}) - keeping existing list"
        return 1
    fi

    if ! jq -e '.prefixes | length > 0' "$json" >/dev/null 2>&1; then
        err "${name}: response is not valid JSON or has no prefixes - keeping existing list"
        return 1
    fi

    {
        echo "# Auto-generated $(date -u +%FT%TZ)"
        echo "# Source: ${url}"
        echo "# Managed by update-verified-bot-ips.sh - DO NOT EDIT BY HAND"
        jq -r '.prefixes[] | (.ipv4Prefix // .ipv6Prefix) | select(. != null) | "\(.) 1;"' "$json"
    } > "$staged"

    count="$(grep -c ' 1;$' "$staged" || true)"
    if [ "$count" -lt "$MIN_PREFIXES" ]; then
        err "${name}: only ${count} prefixes (< ${MIN_PREFIXES}) - refusing to install, keeping existing list"
        rm -f "$staged"
        return 1
    fi

    if [ -f "$out" ] && cmp -s "$staged" "$out"; then
        rm -f "$staged"
        log "${name}: unchanged (${count} prefixes)"
        return 0
    fi

    mv "$staged" "$out"
    chmod 0644 "$out"
    CHANGED=1
    log "${name}: updated -> ${out} (${count} prefixes)"
    return 0
}

# Portable reload: explicit override wins, otherwise try the common options.
reload_nginx() {
    if [ -n "$RELOAD_CMD" ]; then
        eval "$RELOAD_CMD"
        return $?
    fi
    if command -v systemctl >/dev/null 2>&1; then
        systemctl reload nginx && return 0
    fi
    if command -v service >/dev/null 2>&1; then
        service nginx reload && return 0
    fi
    nginx -s reload
}

build_list "googlebot" "$GOOGLE_URL" "${INSTALL_DIR}/googlebot-ips.conf" || true
build_list "bingbot"   "$BING_URL"   "${INSTALL_DIR}/bingbot-ips.conf"   || true

if [ "$CHANGED" -eq 0 ]; then
    log "No changes; not reloading nginx."
    exit 0
fi

if nginx -t >/dev/null 2>&1; then
    reload_nginx
    log "Lists changed and config valid; nginx reloaded."
else
    err "nginx -t FAILED after updating lists; NOT reloading. Output below:"
    nginx -t || true
    exit 1
fi

The generated googlebot-ips.conf ends up as a plain list of ranges, each marked with the value 1:

# Auto-generated 2026-06-15T00:00:00Z
# Source: https://developers.google.com/static/crawling/ipranges/common-crawlers.json
# Managed by update-verified-bot-ips.sh - DO NOT EDIT BY HAND
2001:4860:4801:10::/64 1;
192.178.4.0/27 1;
66.249.64.0/27 1;
66.249.64.128/27 1;
35.247.243.240/28 1;

Installing it

You will need jq (the one dependency that isn’t usually already present): apt install jq, dnf install jq, apk add jq or pkg install jq depending on your distro.

Create the directory and the files. Rather than cloning the repo onto the server, the simplest path is to create each file with an editor and paste the contents in from GitHub:

# Create the directories
mkdir -p /etc/nginx/fake-googlebot /opt/scripts

# Create each file and paste in its contents from GitHub
nano /etc/nginx/fake-googlebot/verified-bots.conf
nano /etc/nginx/fake-googlebot/fake-bot-block.conf
nano /opt/scripts/update-verified-bot-ips.sh

Set ownership and permissions. Everything is owned by root; the configs are world-readable (nginx reads its config as root on load and reload), and the script is not world-accessible:

chown -R root:root /etc/nginx/fake-googlebot
chown root:root /opt/scripts/update-verified-bot-ips.sh
chmod 755 /etc/nginx/fake-googlebot
chmod 644 /etc/nginx/fake-googlebot/*.conf
chmod 750 /opt/scripts/update-verified-bot-ips.sh

Run the script once to generate the IP lists. Do this before adding the includes, or nginx -t will fail on the missing files:

/opt/scripts/update-verified-bot-ips.sh

On this first run the script generates the lists and then reloads nginx, but since the includes are not in place yet, that reload is a no-op as far as bot blocking goes - it just picks up the (as yet unreferenced) list files. The blocking only takes effect once you add the includes below.

Load the definitions once, inside the http{} block of your nginx.conf:

include /etc/nginx/fake-googlebot/verified-bots.conf;

Then add the enforcement to each server{} block you want protected:

include /etc/nginx/fake-googlebot/fake-bot-block.conf;

Test and reload:

nginx -t && nginx -s reload

Finally, a cron job to keep the lists current. Twice daily is plenty, since both providers refresh roughly daily, and nginx only reloads when something actually changes:

17 */12 * * * /opt/scripts/update-verified-bot-ips.sh >> /var/log/verified-bot-ips.log 2>&1

If your origin is behind Cloudflare or any other proxy, there is an important prerequisite: geo matches $remote_addr, so you must restore the real client IP first, or every request will look like it comes from the proxy and the check will misfire. The nginx-cloudflare-real-ip project handles that.

Testing before and after

Send a spoofed Googlebot request from a machine that is not a Google IP - your own machine, or the server itself. Before the block is in place, it is served normally:

curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" https://example.com/
# HTTP/2 200

After adding the include and reloading, the same request is dropped:

curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" https://example.com/
# Direct origin:     curl: (52) Empty reply from server   <- the 444
# Behind Cloudflare: HTTP/2 520                            <- see note below

A control request with no crawler User-Agent must still return 200:

curl -I https://example.com/
# HTTP/2 200

Either way, the drop is recorded in the access log as 444, even though the client sees nothing:

203.0.113.10 - - [15/Jun/2026:19:33:33 +0000] "HEAD / HTTP/2.0" 444 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The Cloudflare 444 / 520 gotcha

If you test from behind Cloudflare you will get an HTTP 520 rather than an empty reply, and it is worth understanding why, because it looks like an error when it is actually success.

return 444 closes the connection with no response. Talking to nginx directly, the client just sees an empty reply. But with Cloudflare in front, Cloudflare is the client: it makes the request to your origin, the origin closes the connection with nothing, and Cloudflare interprets “origin returned nothing” as a 520 - Web server returned an unknown error and serves the visitor its own error page. The block worked perfectly - the request never reached your application - Cloudflare simply has no way to represent “nothing” to the browser.

The thing to know is that those 520s land in your Cloudflare analytics as origin 5xx errors, which can muddy real origin-health monitoring. If you would rather fake bots receive a clean response, change 444 to 403 in fake-bot-block.conf. Cloudflare passes an origin 403 straight through unchanged, so the bot gets a clean Forbidden and your analytics stay clean. Behind a proxy, 444 was never truly silent anyway.

Performance

The per-request overhead is negligible, well under 10 microseconds. The geo lookups are radix-tree matches compiled at config-load time, so they are effectively constant-time regardless of list size. The User-Agent map runs a handful of short regexes once per request (nginx caches the result), and the remaining map and if are simple comparisons. For context, that is on the order of one ten-thousandth of a typical dynamic request, dwarfed by TLS, application processing and network round-trips. The only measurable cost is at config reload, where parsing a few hundred extra CIDR lines adds a millisecond or two to nginx -t - a one-time cost, not per-request.

What this does and doesn’t catch

A couple of deliberate boundaries are worth being explicit about.

The Googlebot match is ~*googlebot, which covers Googlebot and its Image, Video and News variants (they share the common-crawler ranges). It deliberately does not match AdsBot-Google, Mediapartners-Google or Google-InspectionTool, because those originate from different ranges - the special-crawlers and user-triggered-fetcher lists - and would generate false positives against this allowlist. If you need to cover them, add more geo blocks fed from those JSON files.

And this verifies origin, not behaviour. Traffic from a genuinely-listed provider IP will always pass, even if that particular IP is misbehaving (Bing has a known issue where listed IPs send junk query strings). That is a job for rate-limiting and a WAF, not for IP verification.

The full project, with the README and install notes, is on GitHub at robwent/nginx-block-fake-googlebot. Pair it with an IP reputation blocklist like the Nginx Bad Bot Blocker and the bots that don’t bother to disguise themselves get caught too.