Making ai.rud.is Legible To Machines

The blog already had robots.txt, an /llms.txt endpoint, RSS, and per-post .md renditions for LLM retrieval. But a conversation with a couple agents about 2026 best practices for site discoverability turned into a proper audit, and the audit turned into a co-working session with some additional agents. Here’s what got added and why.

The New Root-Level Files

ai.txt

The ai.txt specification is a structured plain text file that declares how AI systems should interact with your content. It’s the behavioral complement to llms.txt (which is about identity and content). Where llms.txt says “here’s what’s on my site,” ai.txt says “here’s what you can and can’t do with it.”

The file lives at /ai.txt and declares permissions (summarize, quote with attribution, include in search results, use for inference-time retrieval), restrictions (no fabricated attribution, no full reproduction, no training without permission), and attribution preferences (cite as hrbrmstr / ai.rud.is, include post title and date, link to the original).

The [training] section draws an explicit line: inference-time retrieval and citation with attribution is permitted, but content is not licensed for model training. This is advisory, not enforceable, but it establishes clear intent.

llm.txt, llms.html, ai.json, identity.json

The ai-visibility.org.uk specifications define a broader set of AI Discovery Files beyond llms.txt. I added four more that can be auto-generated from existing site data:

/llm.txt — A compatibility copy of llms.txt. Some AI systems request the singular filename variant; this ensures they find the same content.
/llms.html — An HTML presentation of the llms.txt content with Schema.org Organization structured data. Includes a <meta name="robots" content="noindex"> to avoid duplicate content issues and a <link rel="canonical"> pointing to llms.txt.
/ai.json — Machine-parseable permissions and restrictions. Declares what AI systems can do (summarise, quote with attribution, answer questions) and can’t do (fabricate quotes, imply endorsement, reproduce full articles). Includes attribution preferences and a link to the published JSON Schema.
/identity.json — Structured identity data aligned with Schema.org. Provides the canonical name, URL, description, and sameAs links to Mastodon, Bluesky, GitHub, and Sourcehut.

All four are generated at build time by the existing llmsTxt.ts Astro integration, alongside llms.txt and per-post .md files. No manual updates needed when content changes.

security.txt

RFC 9116 defines a machine-parseable file for vulnerability disclosure contact info. For a security practitioner’s blog, this is table stakes. The file lives at /.well-known/security.txt with Contact, Expires (one year out), Preferred-Languages, and Canonical fields.

NOTE: Caddy needs to serve it as text/plain; charset=utf-8.

WebFinger

WebFinger lets someone look up @hrbrmstr@ai.rud.is in a Fediverse client and get redirected to the actual Mastodon profile at mastodon.social. The domain becomes an identity alias.

For a static site, WebFinger is a static JSON file at /.well-known/webfinger.json with a Caddy rewrite that routes /.well-known/webfinger requests to it (ignoring the query parameter, since there’s only one identity on the domain):

handle /.well-known/webfinger {
  header Content-Type "application/jrd+json"
  rewrite * /.well-known/webfinger.json
  file_server
}

The JSON response contains subject, aliases, and links pointing to the Mastodon profile. Combined with the <a rel="me"> tag already in the HTML <head> and the sameAs array in the JSON-LD, this closes the identity verification loop across the domain, the Fediverse, and structured data.

JSON Feed

The site already had RSS at /rss.xml. JSON Feed 1.1 is the modern alternative, and since the data pipeline already existed in the RSS endpoint, adding /feed.json was nearly free.

The implementation reuses the same getCollection("blog") query, getSortedPosts filter/sort, and getPath URL generation. The only new code is the JSON Feed 1.1 shape:

const feed = {
  version: "https://jsonfeed.org/version/1.1",
  title: SITE.title,
  home_page_url: SITE.website,
  feed_url: new URL("/feed.json", SITE.website).href,
  description: SITE.desc,
  items: sortedPosts.map(({ data, id, filePath }) => {
    const url = new URL(getPath(id, filePath), SITE.website).href;
    return {
      id: url,
      url,
      title: data.title,
      summary: data.description,
      date_published: new Date(data.pubDatetime).toISOString(),
      authors: [{ name: data.author }],
    };
  }),
};

Auto-discovery is handled by a <link rel="alternate" type="application/feed+json"> tag in the <head>, alongside the existing RSS link.

Enriched JSON-LD

The site already had a BlogPosting JSON-LD block in the <head>, but it was minimal: @type, headline, image, datePublished, dateModified, and author with name/url.

The inline JSON-LD construction was extracted into a JsonLd.astro component and enriched on blog post pages with:

description from the post frontmatter
url (the canonical post URL)
inLanguage from SITE.lang
publisher (same Person as author — it’s a personal blog)
mainEntityOfPage referencing the WebPage
keywords joined from the post’s tag array
sameAs on the author object, linking Mastodon, Bluesky, GitHub, and Sourcehut

Non-post pages (home, projects, about) continue rendering the base schema without enrichment.

The sameAs URLs live in the SITE config object rather than being hardcoded in the component, so they’re easy to update when profiles change:

export const SITE = {
  // ...
  sameAs: [
    "https://mastodon.social/@hrbrmstr",
    "https://bsky.app/profile/hrbrmstr.bsky.social",
    "https://github.com/hrbrmstr",
    "https://sr.ht/~hrbrmstr",
  ],
} as const;

Open Graph Fixes

A site audit flagged several missing Open Graph meta tags. The fixes:

og:type now renders as article on blog post pages and website on everything else. Without this, the default is website, which means the article:published_time and article:modified_time tags that were already present were technically orphaned.

og:site_name (ai.rud.is), og:locale (en_US), article:author, and per-tag article:tag meta tags were added. A <link rel="author" href="/about"> establishes authorship at the HTTP level.

The tags prop was threaded from PostDetails.astro through Layout.astro to both the JsonLd component and the article:tag meta tags. On non-post pages, tags is undefined and nothing renders.

Markdown Link in Post Metadata

Each post’s metadata line (the calendar icon + author + date row) now includes a link to the .md rendition of the post. It shows up as a document icon followed by “MD”, separated from the date by a middot. The Datetime.astro component gained an optional markdownUrl prop that only PostDetails.astro passes — card previews on the home page are unaffected.

Caddy Configs for Scanner Entertainment

The recon traffic hitting this blog is what you’d expect: .env credential hunting, .git/config leaking, /admin and /wp-login.php probing, /_next/data from someone who thinks this is a Next.js app.

A few Caddy handle blocks for this:

Fake .env response that wastes automated pipeline time:

@env_hunters {
  path /.env /.env.* /.git/* /.git/config
}
handle @env_hunters {
  respond "DB_HOST=localhost
DB_USER=admin
DB_PASS=hunter2
AWS_ACCESS_KEY=AKIA3F7M9B2X4P8N1R6Q
AWS_SECRET_KEY=please_stop_scanning_my_blog
" 200
}

Separate recon log for feeding into DuckDB analysis:

@recon_noise {
  path /.env /.env.* /.git/* /api /admin /login /signup /register
  path /dashboard /wp-admin /wp-login.php /wp-content/* /xmlrpc.php
  path /_next/* /actuator/* /solr/* /console /phpmyadmin/*
}
handle @recon_noise {
  log {
    output file /var/log/caddy/recon.log
    format json
  }
  respond 204
}

The recon log is the genuinely useful part — pipe it into DuckDB, correlate with JA4 fingerprints, and the data becomes blog content that writes itself.

The Full Stack

After all the changes, the discoverability surface looks like this:

File	Purpose
`/robots.txt`	Crawl access control (unchanged)
`/llms.txt`	Curated content index for LLM retrieval (unchanged)
`/llm.txt`	Compatibility copy of llms.txt (new)
`/llms.html`	HTML presentation with Schema.org (new)
`/ai.json`	Machine-parseable permissions/restrictions (new)
`/identity.json`	Structured identity data (new)
`/ai.txt`	AI usage permissions and restrictions (new)
`/.well-known/security.txt`	Vulnerability disclosure contact (new)
`/.well-known/webfinger`	Fediverse identity alias (new)
`/rss.xml`	RSS feed (unchanged)
`/feed.json`	JSON Feed 1.1 (new)
`/sitemap-index.xml`	Sitemap (unchanged)
Per-post `.md` files	Markdown renditions for LLM retrieval (unchanged)

And in the HTML <head> of each blog post:

Tag	Purpose
`og:type=article`	Correct OG type for blog posts (fixed)
`og:site_name`, `og:locale`	Site identity (new)
`article:author`, `article:tag`	Article metadata (new)
JSON-LD `BlogPosting`	Enriched structured data with `sameAs`, `keywords`, `publisher`, `mainEntityOfPage` (enhanced)
`<link rel="alternate" type="application/feed+json">`	JSON Feed auto-discovery (new)
`<link rel="author" href="/about">`	Author link relation (new)

None of it’s “flashy”; it’s more like plumbing – keeping the machines that read your site pointed at the same consistent identity chain from your domain to your Fediverse handle to your structured data. The pieces are small and mostly boring to wire up. But a site that’s legible to crawlers, citation systems, and verification tools is just more useful than one that isn’t, and it doesn’t take much to get there.