Last updated: 2026-04-24T00:06:50Z | Log entries analyzed: 7546 | Model: glm-5.1:cloud

Site Observatory: ai.rud.is

Observation window: 2026-03-10 09:10:30.393 — 2026-04-23 23:47:29.906 (44 days) Total requests: 7,546 across 2,052 unique IPs Median response: 1.08 ms | p95: 29.61 ms | Mean: 150.35 ms (some serious tails on that distribution)

Traffic Summary

7,546 requests in 44 days. That’s the pace of a personal blog — not a media property, not a SaaS frontend. The interesting story is who those requests come from.

Legitimate traffic (visitors + owner + RSS) accounts for ~39.8% of requests. Automated traffic of all stripes — AI crawlers, search crawlers, other crawlers, and scanners — makes up the remaining ~60.2%. The signal-to-noise ratio is roughly 2:3 against you. For a personal blog, this is unremarkable. For the internet as a whole, it’s a quiet indictment.

The 404 rate sits at 23.2% (1,751 requests), almost entirely driven by scanners hitting phantom endpoints. The 10.7% 308 redirect rate is Caddy doing its job pushing HTTP→HTTPS or canonicalizing paths.

Resource Consumption

Traffic Class	Bytes Served	% of Bandwidth	Requests
visitor	40,368,793	35.4%	2,256
ai_crawler	30,316,411	26.6%	1,862
owner	17,198,247	15.1%	660
other_crawler	14,090,494	12.3%	802
search_crawler	11,697,774	10.2%	439
scanner	342,768	0.3%	1,433
rss_reader	129,257	0.1%	94

AI crawlers consume 26.6% of outbound bandwidth — second only to real visitors — while delivering zero human readership. Scanners, despite generating 19% of requests, consume 0.3% of bandwidth because nearly everything they hit returns a 404 with a minimal response body. The machines are eating well.

Note: these figures reflect bytes actually transferred on the wire. Conditional 304 responses and cached 308 redirects register 0 bytes, so logical content served is higher than the 114,143,744-byte total suggests.

Temporal Patterns

The hourly distribution tells a clean story about who operates when.

Humans read during waking hours. Visitor traffic peaks at hours 10–11 (170–171 requests), hour 16 (156), and hour 18 (191). There’s a secondary bump at hour 23 (135) — the bedtime scroll is real.

Scanners operate in sharp, concentrated bursts. Hour 12 saw 673 scanner requests — that’s a single campaign hitting in one window (the 185.177.72.38 blast on 2026-03-11). Hours 1 and 5 show spikes of 251 and 282 respectively, also campaign-driven. Outside these bursts, scanner traffic is nearly zero. They don’t browse; they raid.

AI crawlers are the most evenly distributed class, with a modest peak at hour 15 (199 requests). They don’t sleep, but they also don’t rush. This is consistent with distributed crawling infrastructure — no single bursty campaign, just steady ingestion around the clock.

Search crawlers show a slight afternoon preference, peaking at hour 15 (82 requests), which aligns with typical crawl scheduling from the major engines.

Day-of-week patterns reinforce this. Scanner traffic concentrates on Wednesday (725 requests — again, that single campaign), Friday (253), and Saturday (310). AI crawlers favor Sunday (417) and are quietest Saturday (156). Visitors are remarkably flat across the week, with Wednesday slightly leading (428). The owner class spikes on Saturday (172) and Monday (190) — weekend tinkering and Monday morning check-ins.

Content & Visitors

What People Actually Read

Page	Hits	Unique Visitor IPs	Avg ms
`/`	393	235	5.8
`/posts/`	76	43	10.7
`/tags/`	53	43	123.4
`/posts/observatory/`	53	36	31.0
`/about/`	50	40	4.7
`/search/`	37	32	3.1
`/projects/`	38	31	5.6
`/posts/2026-04-05-unprompted-orbie/`	28	25	18.4
`/posts/2026-04-04-starlog-and-the-case-of-the-missing-llm-tag/`	27	24	13.1
`/posts/2026-04-04-outline-bookmark-ext/`	28	23	23.4

The homepage dominates, as expected. The /tags/ page is suspiciously slow at 123.4 ms average — nearly 12x the homepage. That’s worth investigating; tag aggregation on a static site shouldn’t be that expensive unless it’s dynamically generated or the page is unusually large.

Recent posts from early April are getting read, which is good — the blog is active and people are finding new content. The observatory post itself drew 36 unique visitor IPs, which for this site counts as a hit.

Referrers

Google search sends the most legitimate referral traffic: 11 hits from https://www.google.com/search?q=rud and 9 from https://www.google.com/. Modest but real.

The Baidu mobile referrals are almost certainly referrer spam: http://m.baidu.com/s?wd=sheep708, coffeew6i, zoojpg, eara1r, suit6am, hugeit4. These are garbage keywords with no semantic connection to the site’s content. Classic referrer spam — the kind that shows up in analytics dashboards hoping you’ll click through. Ignore it.

HTTP/3 Adoption

Among real visitors, HTTP/3 usage sits at 111 out of 2,256 requests (~4.9%). The owner class is heavily on HTTP/3 (507 of 660 requests — 76.8%), which makes sense; if you run the site, you’re probably using a browser that supports it and have the connection primed.

Scanners are almost entirely on HTTP/1.1 (1,425 of 1,433). Their tooling doesn’t bother with ALPN negotiation. AI crawlers prefer HTTP/2.0 (1,468 of 1,862) — their infrastructure is modern enough for h2 but not h3. Search crawlers similarly favor HTTP/2.0 (319 of 439).

The TLS negotiation data corroborates this: 42.2% of connections have no negotiated protocol (raw HTTP/1.1 without ALPN), 34.1% negotiate h2, 15.5% negotiate http/1.1 via ALPN, and 8.2% negotiate h3.

Browser Families

Chrome leads at 1,091 requests (407 IPs), followed by the catch-all “Other” at 596 (144 IPs — mostly bot UA strings that don’t parse cleanly). Safari at 281 (115 IPs), Firefox at 185 (102 IPs), Edge at 101 (52 IPs), and Opera barely registers at 2. The “Other” bucket being the second-largest family tells you how much non-browser traffic is in the visitor class.

RSS Activity

RSS reader traffic is thin: 94 requests from just 2 unique IPs, all appearing between 2026-04-20 and 2026-04-23. Two subscribers, both polling regularly. A handful of visitor-class hits on RSS endpoints (24 requests, 14 IPs) suggest some people checking feeds via browser. AI crawlers also poke at RSS (13 requests, 13 IPs) — they’re harvesting every content channel they can find.

AI Crawler Activity

AI crawlers account for 1,862 requests (24.7% of all traffic) from 700 unique IPs. For a small personal blog, this is disproportionate.

Bytespider (ByteDance) is the most aggressive: 609 requests from 297 IPs across 89 unique pages. Nearly a third of all AI crawler traffic, from hundreds of distinct addresses. ByteDance operates a distributed crawl infrastructure, and the IP sprawl reflects that. They’re not subtle.

ClaudeBot (Anthropic) comes in second at 499 requests but from only 13 IPs. That’s ~38 requests per IP — concentrated, efficient crawling from a small number of hosts. Anthropic runs a tighter ship.

GPTBot (OpenAI) made 233 requests from 11 IPs but hit 119 unique pages — the broadest page coverage of any AI crawler. They’re thorough, methodically walking the site’s content tree.

Applebot (177 requests, 142 IPs) and Amazonbot (175 requests, 141 IPs) both show the distributed-IP pattern similar to Bytespider, though less aggressively.

Meta Crawler hit 101 requests from 76 IPs across 77 pages — moderate and distributed.

OAI-SearchBot is a curious entry: 57 requests from 13 IPs hitting exactly 1 unique page. OpenAI’s search crawler is fixated on a single resource. Whether that’s a sitemap, a specific post, or an API endpoint would require deeper log inspection, but the single-page focus is unusual.

PerplexityBot barely shows up: 9 requests, 6 IPs, 3 pages. CCBot (Common Crawl) is even more minimal: 2 requests, 1 IP, 2 pages.

The takeaway: ByteDance, Anthropic, and OpenAI are the heavy consumers here. For a personal blog with no ad revenue, no analytics monetization, and no audience scale, 26.6% of your bandwidth going to AI training data harvesters is a pure cost — you pay for the egress, they get the content.

Search Engine Crawlers

439 requests from 277 IPs. The distribution is… not what you’d expect for a Western-facing blog.

Baiduspider dominates: 237 requests from 171 IPs across 96 pages, last seen 2026-04-23 18:05:19.661. Baidu is indexing this site more aggressively than any other search engine. Whether that translates to actual Chinese-language traffic is another question.

Bingbot comes in second: 160 requests, 95 IPs, 43 pages, last seen 2026-04-23 23:18:47.633. Active and recent.

Googlebot is surprisingly sparse: 35 requests from 4 IPs across 9 pages, last seen 2026-04-22 13:28:11.011. For the dominant search engine in the West, Google is barely touching this site. The 4 IPs are the persistent 66.249.74.69, 66.249.74.70, and 66.249.74.71 (which appear in IP persistence data across 5–7 days) plus one more. Google knows about the site but isn’t crawling it deeply. If search traffic matters, this is worth investigating — check Google Search Console for indexing status.

YandexBot: 7 requests, 7 IPs, 2 pages. Token presence.

Scanner & Recon Activity

1,433 requests from 91 IPs. Most of it is blunt-force path enumeration — untargeted, automated, and largely fruitless.

The Heavy Hitters

185.177.72.38 fired 662 requests in 74 seconds (2026-03-11 12:37:31.928 to 12:38:45.218), hitting 662 unique URIs with 661 returning 404. Using curl/8.7.1. This is pure brute-force path enumeration — no stealth, no sophistication, just a wordlist and a for loop. It accounted for most of the 2026-03-11 scanner spike (686 total scanner requests that day).

172.94.9.253 is the persistent one: 529 requests over two weeks (2026-03-28 05:11:50.19 to 2026-04-11 20:34:19.074), 470 of which returned 404, across 30 unique URIs. Spoofing Firefox 124.0 on Windows. This is slower, more deliberate recon — probably an automated vulnerability scanner with rate-limiting to avoid detection. It didn’t work; it just took longer to find nothing.

134.209.25.199 hit 38 requests in 37 seconds (2026-03-10 09:11:48.416 to 09:12:25.246) with the UA Mozilla/5.0 (l9scan/2.0.731313e21353e23393e2237313; +https://leakix.net). This is LeakIX’s scanner — an internet-wide exposure scanning project. The encoded string in the UA is likely a campaign identifier. Professional, targeted, and honest about its identity.

104.244.74.39 made 24 requests using Python/3.13 aiohttp/3.11.18 — a custom script, no UA spoofing. Hit 4 unique URIs, all 404. Quick and dirty.

195.178.110.102 fired 17 requests in under a second (2026-04-20 19:22:52.346 to 19:22:53.33), all 404, 17 unique URIs. Spoofing Chrome 120. Fast scan, zero subtlety.

2a14:7c1:400::1 is the most interesting: 7 requests over a month (2026-03-20 to 2026-04-23), hitting 1 unique URI, all 404. This IPv6 address has been persistently probing the same endpoint across multiple days. It’s also classified as a scanner in the IP persistence table (6 days seen). Someone — or something — is specifically interested in one path on this server.

What They’re Looking For

The recon URI list reads like a checklist of common misconfigurations:

/.env (27 hits from 16 IPs) and /.git/config (19 hits from 16 IPs): The classics. If you expose either of these, you’ve already lost. They don’t care if you’re a personal blog; they’re checking every IP on the internet.
/_next/data (24 hits from 1 IP): Next.js fingerprinting. Someone’s checking if this is a Next.js app with exposed data routes.
/api, /admin, /dashboard: API and admin panel discovery.
/signup, /login, /register, /signin: Auth endpoint enumeration — looking for default credentials or open registration.
/app, /product, /products, /shop, /checkout: E-commerce recon. This is a blog. There is nothing to buy.
/en, /fr, /de, /es: Language/locale path discovery, mostly from a single IP — likely a CMS fingerprinting tool checking for WordPress-style multilingual setups.

The /favicon.ico hits (41 from 41 IPs) are normal browser behavior, not recon.

IP Persistence

IP persistence is an imperfect proxy for repeat behavior — IPv6 privacy extensions rotate addresses, NATs collapse individuals, and CDN egress IPs aggregate users. Treat these as “this address showed up on multiple days,” not “this person came back.”

The most persistent IPs are Googlebot’s: 66.249.74.69 (7 days, 26 requests), 66.249.74.70 (6 days, 23 requests), and 66.249.74.71 (5 days, 17 requests). All three are Google’s canonical crawl addresses, active from 2026-04-06 through 2026-04-19.

2a14:7c1:400::1 appeared across 6 days with 6 total requests — the persistent single-target scanner noted above. One URI, one agenda, over a month.

178.22.106.230 showed up on 5 days with 122 requests (2026-03-10 to 2026-04-07). High volume, long persistence — likely a crawler or automated tool rather than a human.

73.61.103.2 appeared on 5 days with 30 requests (2026-04-04 to 2026-04-19). Could be a genuine reader; the volume is human-scale.

A cluster of Google Cloud IPs (34.55.179.213, 34.55.208.73, 34.70.54.47, 34.16.67.187, 34.9.161.79, 35.238.21.122, 35.223.96.82, 34.41.200.76, 34.152.6.29) each appear on 3–4 days, all with first-seen dates clustered around 2026-04-12. This looks like a single distributed crawl job running across multiple compute instances — likely an AI crawler or research project using Google Cloud egress.

Observations

1. Google is ghosting you. 35 requests from 4 IPs in 44 days. Baiduspider sent 237 requests from 171 IPs. For a blog that presumably targets English-speaking readers, having Baidu index you 7x harder than Google is a structural problem, not a traffic problem. Check Google Search Console. If the site isn’t indexed well, the fix is probably sitemap submission and patience — but the asymmetry is stark and worth understanding.

2. AI crawlers are your second-largest bandwidth consumer, and they don’t pay your hosting bill. 30.3 MB of egress for AI training data, versus 40.4 MB for actual humans. Bytespider alone accounts for more requests than all search crawlers combined. If you’re on Hetzner with metered egress or bandwidth limits, this is a direct cost. Consider robots.txt restrictions or Caddy rate-limiting for known AI crawler user agents — not for ideological reasons, but because you’re subsidizing their training data pipeline.

3. The scanner traffic is noise, but the persistent scanner is signal. 99% of scanner hits are drive-by path enumeration from botnets and vulnerability scanners that will never return. They cost you nearly nothing in bandwidth (0.3% of total). But 2a14:7c1:400::1 — hitting the same URI once every few days for a month — is a different pattern. That’s targeted reconnaissance, not spray-and-pray. Worth watching, even if the target URI returns 404.

Generated by observatory.sh — Caddy logs → DuckDB → Ollama → Astro