
You Optimized for SEO. AI Search Engines Still Ignore You.
You have llms.txt. You have structured data. You even allowed GPTBot in your robots.txt. But when someone asks ChatGPT or Perplexity about your product, your site doesn't show up. The answer comes from some blog that paraphrased your content three months ago.
The problem is not visibility. The problem is that AI search engines are scraping your site the same way they scrape everyone else's. You are one of billions of pages in a parse queue. There is no signal that says "this site has a direct knowledge API you can query instead of guessing from HTML."
We built four features on pixelmojo.io that changed this. Together, they form what we call the AI Discoverability Stack: a set of machine-readable declarations and APIs that turn your site from a scraped source into a primary citation target.
This is not theory. We implemented it, tested it, and the responses came back with our URLs as sources. This is Part 6 of The AI Search Playbook, our series on moving from traditional SEO to AI-native discoverability.
The Core Problem: Scraped vs. Primary Source
Most websites have a passive relationship with AI search engines. A bot crawls your page, parses the HTML, extracts what it can, and moves on. If your content is clear enough, it might get paraphrased into an AI-generated answer. If you are lucky, there is a link back to you.
This is being scraped. You have no control over how your content is represented, no guarantee of attribution, and no way to correct misunderstandings.
Being a primary source is fundamentally different. The AI system discovers that you have a knowledge API, queries it with a structured request, and gets back a structured response with your answer, your source URLs, and your entity context. The citation is built into the response format.
Scraped vs. Primary Source
Parses raw HTML
Structured API response
AI paraphrases (or hallucinates)
You control the answer
Maybe a link, maybe not
Source URLs in every response
Depends on crawl luck
Declared in ai-plugin.json
Waits for next crawl
Real-time from knowledge graph
The difference is not about content quality. It is about infrastructure. A site with a knowledge API, declared capabilities, and connected structured data tells AI systems: "You can query me directly. Here is exactly how." We covered the foundations in our knowledge graph implementation guide. The Discoverability Stack builds on that foundation.
Feature 1: Connected JSON-LD with @id References
Most sites have JSON-LD that works but doesn't connect. Your product page has SoftwareApplication schema. Your about page has Organization schema. Your blog posts have Article schema. But they don't reference each other.
The Problem with Isolated Schemas
When each page's structured data exists in isolation, AI systems treat them as separate data points. They cannot infer that Vector is a product made by Pixelmojo, that Hive is built on Vector, or that your blog posts are written by your organization's founder.
The Fix: @id URIs
Adding @id to your schemas creates anchors that other schemas can reference:
{
"@type": "SoftwareApplication",
"@id": "https://www.pixelmojo.io/#vector",
"url": "https://www.pixelmojo.io/vector",
"name": "Vector by Pixelmojo",
"provider": {
"@type": "Organization",
"@id": "https://www.pixelmojo.io/#organization"
}
}
Now when your Hive product page references Vector, it uses the @id instead of just a URL:
{
"isBasedOn": {
"@type": "SoftwareApplication",
"@id": "https://www.pixelmojo.io/#vector"
}
}
AI systems can now traverse your entire entity graph. Product references organization. Product B references Product A. Blog posts reference both. Everything connects.
BreadcrumbList for Hierarchy
We also added BreadcrumbList schema to product pages:
{
"@type": "BreadcrumbList",
"itemListElement": [
{ "position": 1, "name": "Home", "item": "https://www.pixelmojo.io" },
{
"position": 2,
"name": "Products",
"item": "https://www.pixelmojo.io/products"
},
{
"position": 3,
"name": "Vector",
"item": "https://www.pixelmojo.io/vector"
}
]
}
This tells both Google and AI systems exactly where each page sits in your site architecture. For a deeper dive on how BreadcrumbList and entity linking work together, see our knowledge graph and LLM visibility guide.
Feature 2: ai-plugin.json
AI systems currently have to discover your capabilities by accident. They might find your llms.txt if they know to check for it. They might find your API if they stumble across documentation. There is no standard way to declare "here is what I offer machines."
ai-plugin.json is an emerging convention that solves this. It lives at /.well-known/ai-plugin.json (following the RFC 8615 well-known URI standard) and declares your site's AI capabilities in a single machine-readable file.
What It Contains
{
"schema_version": "v1",
"name": "Pixelmojo",
"description": "AI-native product studio. Build revenue-generating AI products in 90 days.",
"url": "https://www.pixelmojo.io",
"logo_url": "https://www.pixelmojo.io/pixelmojo-branding.svg",
"contact_email": "founders@pixelmojo.io",
"ai_capabilities": {
"llms_txt": "https://www.pixelmojo.io/llms.txt",
"llms_full_txt": "https://www.pixelmojo.io/llms-full.txt",
"ask_api": "https://www.pixelmojo.io/api/ask",
"policy": "https://www.pixelmojo.io/.well-known/ai-policy.json"
},
"api": {
"type": "openapi",
"endpoints": [
{
"path": "/api/ask",
"method": "POST",
"description": "Ask a question about Pixelmojo's products, services, and expertise",
"rate_limit": "10 requests per minute"
}
]
}
}
One file. An AI agent reads it and immediately knows: where your llms.txt is, that you have a knowledge API at /api/ask, what your rate limits are, and where your usage policy lives.
Implementation
In Next.js, this is a static route handler:
// src/app/.well-known/ai-plugin.json/route.ts
import { NextResponse } from 'next/server'
export const dynamic = 'force-static'
export const revalidate = 86400
export function GET() {
return NextResponse.json(aiPlugin, {
headers: { 'Cache-Control': 'public, max-age=86400' },
})
}
Static generation, 24-hour cache. Zero runtime cost.
Feature 3: The Knowledge API (/api/ask)
This is the centerpiece of the stack. Instead of making AI systems parse your HTML, you give them a structured endpoint to query directly.
The AI Discovery Chain
ai-plugin.json
AI system discovers your capabilities
llms.txt
Reads structured site overview
/api/ask
Queries your knowledge directly
Citation
Cites your content with URLs
How It Works
Input: A POST request with a question.
curl -X POST https://www.pixelmojo.io/api/ask \
-H 'Content-Type: application/json' \
-d '{"question": "What is Vector?"}'
Context retrieval: The API scores all entities in your knowledge graph and all blog posts by keyword overlap with the question. No vector database, no embeddings. Simple word matching against entity keywords, post titles, descriptions, and tags. Top 3 entities and top 5 posts become the context.
LLM call: The context goes to GPT-4o-mini with strict instructions: answer from context only, never fabricate, return structured JSON.
Output: A structured response with everything an AI system needs to cite you.
/api/ask Response Structure
Every response includes citation URLs and entity context
Every response includes:
- answer: 1-3 sentence answer grounded in your actual content
- sources: Blog post URLs that the answer draws from
- relatedEntities: Products, methodologies, and services connected to the question
- confidence: How well the context matched the question (0 to 1)
Rate Limiting
The endpoint uses the same in-memory Map pattern as our other API routes: 10 requests per minute per IP. Enough for legitimate AI agent queries, restrictive enough to prevent abuse.
No Vector Database Required
For sites with under 100 pages of content, keyword matching works well. We tokenize the question into words, match against entity keywords and post metadata, and rank by overlap count. The LLM handles the rest. When your content scales beyond this, you can add embeddings later. Start simple.
Feature 4: Auto-Generated FAQ Schema at Scale
FAQPage structured data is one of the highest-value schema types for AI citations. In our own GEO implementation, FAQ schema was the single highest-ROI change we made. When someone asks Perplexity a question and your page has a FAQPage schema that directly answers it, the citation probability increases significantly.
The problem: manually writing FAQ schema for every blog post takes approximately 15 minutes per post. With 45 posts, that is over 11 hours of work. And every new post needs it too.
The Automation Pipeline
FAQ Schema Automation Pipeline
45 blog posts parsed
Skips posts with manual FAQPage
3-5 genuine questions per post
Auto-rendered on every blog page
We built a CLI script that automates this:
- Read posts from Contentlayer (the build-time CMS that processes our MDX files)
- Skip posts that already have manual FAQPage schema in their frontmatter
- Send content to GPT-4o-mini with instructions to extract 3-5 genuine questions the article answers
- Write results to
generated/faqs.json, keyed by post slug - Blog page template auto-renders FAQPage schema for any post that has generated FAQs but no manual ones
The Script
// scripts/generate-faqs.ts
// Usage:
// npx tsx scripts/generate-faqs.ts (generate for all missing)
// npx tsx scripts/generate-faqs.ts --force (regenerate all)
// npx tsx scripts/generate-faqs.ts --slug=X (single post)
The script sends a truncated version of each post (first 3,000 characters) to GPT-4o-mini with a strict prompt: "Extract 3-5 genuine questions this article answers. Answers must be 1-2 sentences from the article content. Never fabricate."
Rate limited to 1 request per second to stay within API limits. Results are committed to git alongside the contentlayer output.
Blog Page Integration
The blog page template imports the generated FAQ JSON and checks two conditions before rendering:
- The post does NOT have manual FAQPage schema in frontmatter
- The post DOES have entries in
generated/faqs.json
If both conditions are true, it renders a FAQPage schema from the auto-generated data. Manual FAQs always take priority.
The Numbers
- 45 total blog posts
- ~15 already have manual FAQPage schema
- ~30 posts gain FAQ schema automatically
- 90-150 new question/answer pairs indexed by search engines
- Run time: under 2 minutes for all 30 posts
The Bot Strategy: Block Training, Allow Browsing
This stack works because of a deliberate robots.txt strategy that many sites get wrong. We covered the full bot segmentation approach in our GEO Playbook, but here is the summary.
Blocked (training bots):
- CCBot, anthropic-ai, cohere-ai, Omgili
These bots scrape content to train foundation models. Your content gets baked into model weights permanently, with no attribution, no link back, and no control over how it is used. Blocking them protects your intellectual property.
Allowed (browsing bots):
- GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, GoogleOther
These bots read your content at query time to generate answers. When they cite you, they link back to your site. This is the pipe that sends traffic and citations. You can verify which bots have access to your site using our free robots.txt Analyzer for AI. The AI Discoverability Stack makes this pipe dramatically more effective by giving these bots structured data instead of raw HTML.
Implementation Checklist
If you want to build this for your own site, here is the order:
1. Connect Your JSON-LD (30 minutes)
Add @id URIs to existing schemas. Add url properties. Make provider reference your global Organization by @id. Add BreadcrumbList to product and service pages.
2. Create ai-plugin.json (15 minutes)
Static route handler at /.well-known/ai-plugin.json. Declare your llms.txt location, any API endpoints, and your AI policy. This is a one-time setup.
3. Build the Knowledge API (2 hours)
POST endpoint with rate limiting. Keyword-based context retrieval from whatever content system you use. LLM call with strict grounding instructions. Structured JSON response with sources and entities.
4. Automate FAQ Schema (1 hour)
CLI script that reads your posts, identifies gaps, generates Q&A pairs, and saves to a JSON file. Template integration that auto-renders FAQPage schema.
Total effort: approximately half a day. The ongoing cost is near zero: the knowledge API uses GPT-4o-mini (fractions of a cent per query), the FAQ generator runs only when you publish new posts, and everything else is static.
Testing Your Stack
Check ai-plugin.json
curl https://yoursite.com/.well-known/ai-plugin.json
Should return valid JSON with your capabilities declared.
Test the Knowledge API
curl -X POST https://yoursite.com/api/ask \
-H 'Content-Type: application/json' \
-d '{"question": "What does your company do?"}'
Should return a structured response with answer, sources, and entities.
Verify JSON-LD
View page source on your product pages. Look for @id properties and BreadcrumbList schemas.
Audit Overall Visibility
Use our free AI Crawl Checker to test your bot access, structured data, and llms.txt in one scan. The AI Readiness Score combines all signals into a single 0-100 score.
AI Discoverability Stack: Questions Readers Ask
Common questions about this topic, answered.
What This Means for Your Site
The AI Discoverability Stack is not about gaming AI search engines. It is about making your site a better, more structured source of information that AI systems can reliably query and cite.
Every feature serves a clear purpose:
- Connected JSON-LD makes your entity relationships machine-readable
- ai-plugin.json declares your capabilities so AI systems don't have to guess
- The Knowledge API gives AI agents a direct line instead of forcing them to parse HTML
- Auto FAQ schema makes every piece of content a potential citation target
The combined effect: your site becomes a primary source that AI search engines query and cite, rather than one of billions of pages they might scrape and paraphrase. If you are earlier in the journey, start with our SEO vs GEO vs AEO guide to understand the landscape, then work through the GEO Playbook for tactical foundations before building this stack.
Want to see how your site scores for AI discoverability?
- AI Crawl Checker: Test your bot access, structured data, and llms.txt
- AI Readiness Score: Get a unified 0-100 score across all AI visibility signals
- Contact Us: We build AI discoverability stacks for companies that want to own their narrative in AI search
