What is a knowledge API and how does it help with AI citations?

A knowledge API is a POST endpoint (like /api/ask) that AI systems can query directly with a question and receive a structured JSON response containing an answer, source URLs, related entities, and a confidence score. Instead of scraping your HTML and paraphrasing, AI search engines get clean data with exact citation links.

How do you automate FAQ schema for blog posts?

You create a CLI script that reads your blog posts from your CMS, identifies posts without manual FAQPage schema, sends the content to an LLM (like GPT-4o-mini) to extract genuine questions the article answers, and writes the results to a JSON file. The blog page template then automatically renders FAQPage structured data for any post that has auto-generated FAQs.

What is the difference between being scraped and being a primary source for AI search?

When you are scraped, AI systems parse your raw HTML, paraphrase it (sometimes incorrectly), and may or may not link back to you. When you are a primary source, AI systems query your knowledge API directly, get structured answers with citation URLs, and reliably attribute your content. The AI Discoverability Stack makes this shift happen.

How does connected JSON-LD with @id references improve AI discoverability?

Adding @id URIs to your JSON-LD schemas creates a connected graph that AI systems can traverse. When your product page references your organization by @id, and your blog posts reference products by @id, AI systems understand the relationships between entities on your site instead of treating each page as an isolated data point.

Do I need a vector database for the knowledge API?

No. Simple keyword matching against your existing knowledge graph entities and blog post metadata works well for sites with under 100 pages. You score entities and posts by keyword overlap with the question, take the top matches as context, and send that to the LLM. No embeddings, no vector DB, no extra infrastructure.

How do I test if my AI discoverability stack is working?

Use curl to test your /api/ask endpoint and verify structured responses. Check /.well-known/ai-plugin.json returns valid JSON. Verify JSON-LD in page source for @id references and BreadcrumbList. Use the free AI Crawl Checker at pixelmojo.io/tools/ai-crawl-checker to audit your overall AI visibility score.

Published: March 4, 2026•17 min read

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us

Q: Should I block AI training bots but allow AI browsing bots?

Our position evolved. Initially we blocked all training bots to protect IP. We now selectively allow training bots that feed AI engines we want citing us (CCBot, Google-Extended, anthropic-ai, Applebot-Extended) while continuing to block data brokers and adversarial scrapers (cohere-ai, Meta-ExternalAgent, Bytespider, Diffbot, Omgili). Always allow browsing bots (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot) so AI search engines can read and cite your content at query time. For a brand still earning ground in AI answers, recognition in future model generations matters more than blanket IP protection.

We built 4 features that turn a website from a scraped source into a primary citation target for AI search engines. Here is the exact implementation.

by Lloyd Pilapil

AI Discoverability Stack: connected nodes radiating structured data signals to AI search engines

You Optimized for SEO. AI Search Engines Still Ignore You.

You have llms.txt. You have structured data. You even allowed GPTBot in your robots.txt. But when someone asks ChatGPT or Perplexity about your product, your site doesn't show up. The answer comes from some blog that paraphrased your content three months ago.

The problem is not visibility. The problem is that AI search engines are scraping your site the same way they scrape everyone else's. That is the real barrier to getting cited by ChatGPT, Perplexity, and Claude. You are one of billions of pages in a parse queue. There is no signal that says "this site has a direct knowledge API you can query instead of guessing from HTML."

4 features

form the AI Discoverability Stack: ai-plugin.json, knowledge API, connected JSON-LD, and auto-generated FAQ schema

Source: Pixelmojo implementation, 2026

We built four features on pixelmojo.io that changed this. Together, they form what we call the AI Discoverability Stack: a set of machine-readable declarations and APIs that turn your site from a scraped source into a primary citation target.

This is not theory. We implemented it, tested it, and the responses came back with our URLs as sources. This is Part 6 of The AI Search Playbook, our series on moving from traditional SEO to AI-native discoverability.

TL;DR

ai-plugin.json declares your site capabilities to AI systems in a single file
A knowledge API (/api/ask) lets AI agents query you directly instead of scraping HTML
Connected JSON-LD with @id references creates a traversable entity graph
Auto-generated FAQ schema makes every blog post a citation target
Block training bots (IP protection) but allow browsing bots (citations and traffic)
No vector database needed: keyword matching against your knowledge graph works fine

Stop being scraped. Start being queried. The AI Discoverability Stack gives AI search engines a direct line to your knowledge base.

This builds on Step 3 of the GEO Playbook: How to Get Cited by ChatGPT, Perplexity, Claude & Gemini. The playbook covers the 7-step framework; this post is the deep dive on the four-feature discoverability stack we built.

The Core Problem: Scraped vs. Primary Source

Most websites have a passive relationship with AI search engines. A bot crawls your page, parses the HTML, extracts what it can, and moves on. If your content is clear enough, it might get paraphrased into an AI-generated answer. If you are lucky, there is a link back to you.

This is being scraped. You have no control over how your content is represented, no guarantee of attribution, and no way to correct misunderstandings. If AI engines are already saying inaccurate things about your brand, see Why does ChatGPT say wrong things about my company? for the diagnostic + correction playbook.

Being a primary source is fundamentally different. The AI system discovers that you have a knowledge API, queries it with a structured request, and gets back a structured response with your answer, your source URLs, and your entity context. The citation is built into the response format.

Scraped vs. Primary Source

How AI gets your content

Scraped

Parses raw HTML

Primary Source

Structured API response

Content accuracy

Scraped

AI paraphrases (or hallucinates)

Primary Source

You control the answer

Attribution

Scraped

Maybe a link, maybe not

Primary Source

Source URLs in every response

Discoverability

Scraped

Depends on crawl luck

Primary Source

Declared in ai-plugin.json

Update speed

Scraped

Waits for next crawl

Primary Source

Real-time from knowledge graph

The difference is not about content quality. It is about infrastructure. A site with a knowledge API, declared capabilities, and connected structured data tells AI systems: "You can query me directly. Here is exactly how." We covered the foundations in our knowledge graph implementation guide. The Discoverability Stack builds on that foundation. It also assumes AI links your name to the right entity in the first place; when it does not, that is a brand disambiguation failure to fix before any other optimization compounds.

Feature 1: Connected JSON-LD with @id References

Most sites have JSON-LD that works but doesn't connect. Your product page has SoftwareApplication schema. Your about page has Organization schema. Your blog posts have Article schema. But they don't reference each other. For the standalone how-to on Organization schema specifically, see How do I add Organization schema for AI search?. For the head-to-head between the two signal types, see Schema.org vs llms.txt: which AI search signal matters most?.

The Problem with Isolated Schemas

When each page's structured data exists in isolation, AI systems treat them as separate data points. They cannot infer that Vector is a product made by Pixelmojo, that Hive is built on Vector, or that your blog posts are written by your organization's founder.

The Fix: @id URIs

Adding @id to your schemas creates anchors that other schemas can reference:

{
  "@type": "SoftwareApplication",
  "@id": "https://www.pixelmojo.io/#vector",
  "url": "https://www.pixelmojo.io/vector",
  "name": "Vector by Pixelmojo",
  "provider": {
    "@type": "Organization",
    "@id": "https://www.pixelmojo.io/#organization"
  }
}

Now when your Hive product page references Vector, it uses the @id instead of just a URL:

{
  "isBasedOn": {
    "@type": "SoftwareApplication",
    "@id": "https://www.pixelmojo.io/#vector"
  }
}

AI systems can now traverse your entire entity graph. Product references organization. Product B references Product A. Blog posts reference both. Everything connects.

BreadcrumbList for Hierarchy

We also added BreadcrumbList schema to product pages:

{
  "@type": "BreadcrumbList",
  "itemListElement": [
    { "position": 1, "name": "Home", "item": "https://www.pixelmojo.io" },
    {
      "position": 2,
      "name": "Products",
      "item": "https://www.pixelmojo.io/products"
    },
    {
      "position": 3,
      "name": "Vector",
      "item": "https://www.pixelmojo.io/vector"
    }
  ]
}

This tells both Google and AI systems exactly where each page sits in your site architecture. For a deeper dive on how BreadcrumbList and entity linking work together, see our knowledge graph and LLM visibility guide.

Feature 2: ai-plugin.json

AI systems currently have to discover your capabilities by accident. They might find your llms.txt if they know to check for it (for the quick primer, see What is llms.txt and do I need it?). They might find your API if they stumble across documentation. There is no standard way to declare "here is what I offer machines."

ai-plugin.json is an emerging convention that solves this. It lives at /.well-known/ai-plugin.json (following the RFC 8615 well-known URI standard) and declares your site's AI capabilities in a single machine-readable file.

What It Contains

{
  "schema_version": "v1",
  "name": "Pixelmojo",
  "description": "AI-native product studio. Build revenue-generating AI products in 90 days.",
  "url": "https://www.pixelmojo.io",
  "logo_url": "https://www.pixelmojo.io/pixelmojo-branding.svg",
  "contact_email": "founders@pixelmojo.io",
  "ai_capabilities": {
    "llms_txt": "https://www.pixelmojo.io/llms.txt",
    "llms_full_txt": "https://www.pixelmojo.io/llms-full.txt",
    "ask_api": "https://www.pixelmojo.io/api/ask",
    "policy": "https://www.pixelmojo.io/.well-known/ai-policy.json"
  },
  "api": {
    "type": "openapi",
    "endpoints": [
      {
        "path": "/api/ask",
        "method": "POST",
        "description": "Ask a question about Pixelmojo's products, services, and expertise",
        "rate_limit": "10 requests per minute"
      }
    ]
  }
}

One file. An AI agent reads it and immediately knows: where your llms.txt is, that you have a knowledge API at /api/ask, what your rate limits are, and where your usage policy lives.

Implementation

In Next.js, this is a static route handler:

// src/app/.well-known/ai-plugin.json/route.ts
import { NextResponse } from 'next/server'

export const dynamic = 'force-static'
export const revalidate = 86400

export function GET() {
  return NextResponse.json(aiPlugin, {
    headers: { 'Cache-Control': 'public, max-age=86400' },
  })
}

Static generation, 24-hour cache. Zero runtime cost.

Feature 3: The Knowledge API (/api/ask)

This is the centerpiece of the stack. Instead of making AI systems parse your HTML, you give them a structured endpoint to query directly.

The AI Discovery Chain

Step 1

ai-plugin.json

AI system discovers your capabilities

Step 2

llms.txt

Reads structured site overview

Step 3

/api/ask

Queries your knowledge directly

Step 4

Citation

Cites your content with URLs

How It Works

Input: A POST request with a question.

curl -X POST https://www.pixelmojo.io/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question": "What is Vector?"}'

Context retrieval: The API scores all entities in your knowledge graph and all blog posts by keyword overlap with the question. No vector database, no embeddings. Simple word matching against entity keywords, post titles, descriptions, and tags. Top 3 entities and top 5 posts become the context.

LLM call: The context goes to GPT-4o-mini with strict instructions: answer from context only, never fabricate, return structured JSON.

Output: A structured response with everything an AI system needs to cite you.

/api/ask Response Structure

{

"answer": "Vector is a 12-dimension AI lead..."

"sources": [

{ "title": "...", "url": "https://..." }

]

"relatedEntities": [

{ "name": "Vector", "type": "Product" }

]

"confidence": 0.95

}

Every response includes citation URLs and entity context

Every response includes:

answer: 1-3 sentence answer grounded in your actual content
sources: Blog post URLs that the answer draws from
relatedEntities: Products, methodologies, and services connected to the question
confidence: How well the context matched the question (0 to 1)

Rate Limiting

The endpoint uses the same in-memory Map pattern as our other API routes: 10 requests per minute per IP. Enough for legitimate AI agent queries, restrictive enough to prevent abuse.

No Vector Database Required

For sites with under 100 pages of content, keyword matching works well. We tokenize the question into words, match against entity keywords and post metadata, and rank by overlap count. The LLM handles the rest. When your content scales beyond this, you can add embeddings later. Start simple.

Feature 4: Auto-Generated FAQ Schema at Scale

FAQPage structured data is one of the highest-value schema types for AI citations. In our own GEO implementation, FAQ schema was the single highest-ROI change we made. When someone asks Perplexity a question and your page has a FAQPage schema that directly answers it, the citation probability increases significantly.

The problem: manually writing FAQ schema for every blog post takes approximately 15 minutes per post. With 45 posts, that is over 11 hours of work. And every new post needs it too.

The Automation Pipeline

FAQ Schema Automation Pipeline

1Contentlayer reads MDX

45 blog posts parsed

2Script identifies gaps

Skips posts with manual FAQPage

3GPT-4o-mini extracts Q&A

3-5 genuine questions per post

4FAQPage schema injected

Auto-rendered on every blog page

We built a CLI script that automates this:

Read posts from Contentlayer (the build-time CMS that processes our MDX files)
Skip posts that already have manual FAQPage schema in their frontmatter
Send content to GPT-4o-mini with instructions to extract 3-5 genuine questions the article answers
Write results to generated/faqs.json, keyed by post slug
Blog page template auto-renders FAQPage schema for any post that has generated FAQs but no manual ones

The Script

// scripts/generate-faqs.ts
// Usage:
//   npx tsx scripts/generate-faqs.ts          (generate for all missing)
//   npx tsx scripts/generate-faqs.ts --force   (regenerate all)
//   npx tsx scripts/generate-faqs.ts --slug=X  (single post)

The script sends a truncated version of each post (first 3,000 characters) to GPT-4o-mini with a strict prompt: "Extract 3-5 genuine questions this article answers. Answers must be 1-2 sentences from the article content. Never fabricate."

Rate limited to 1 request per second to stay within API limits. Results are committed to git alongside the contentlayer output.

Blog Page Integration

The blog page template imports the generated FAQ JSON and checks two conditions before rendering:

The post does NOT have manual FAQPage schema in frontmatter
The post DOES have entries in generated/faqs.json

If both conditions are true, it renders a FAQPage schema from the auto-generated data. Manual FAQs always take priority.

The Numbers

45 total blog posts
~15 already have manual FAQPage schema
~30 posts gain FAQ schema automatically
90-150 new question/answer pairs indexed by search engines
Run time: under 2 minutes for all 30 posts

The Bot Strategy: Selective on Training, Open on Browsing

This stack works because of a deliberate robots.txt strategy that many sites get wrong. We covered the full bot segmentation approach in our GEO Playbook, but here is the summary.

Our policy evolved. When we first published this, we blocked all training bots to protect IP. As our citation strategy matured, we realized brand recognition in future model generations matters more than blanket IP protection for a new brand still earning ground in AI answers. We now split training bots into two buckets.

Allowed training bots (these feed AI engines we want citing us):

CCBot (Common Crawl, used by most LLMs), Google-Extended (Gemini), anthropic-ai (Claude), Applebot-Extended (Apple Intelligence)

These bots collect content that becomes part of the model knowledge base. For a brand below the training-data threshold, this is the slow lane to citation growth.

Still blocked (data brokers and adversarial scrapers):

cohere-ai, Meta-ExternalAgent, Bytespider, Diffbot, Omgili

Data brokers resell scraped content. Bytespider feeds ByteDance products with limited Western reach. Blocking these protects against IP-resale without sacrificing meaningful citation lift.

Allowed (browsing bots — always):

GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, GoogleOther

These bots read your content at query time to generate answers. When they cite you, they link back to your site. This is the pipe that sends traffic and citations. You can verify which bots have access to your site using our free robots.txt Analyzer for AI. The AI Discoverability Stack makes this pipe dramatically more effective by giving these bots structured data instead of raw HTML.

“Allow the training bots feeding AI engines you want citing you. Block the data brokers. Always keep browsing bots open.”

The AI Bot Strategy

Implementation Checklist

If you want to build this for your own site, here is the order:

1. Connect Your JSON-LD (30 minutes)

Add @id URIs to existing schemas. Add url properties. Make provider reference your global Organization by @id. Add BreadcrumbList to product and service pages.

2. Create ai-plugin.json (15 minutes)

Static route handler at /.well-known/ai-plugin.json. Declare your llms.txt location, any API endpoints, and your AI policy. This is a one-time setup.

3. Build the Knowledge API (2 hours)

POST endpoint with rate limiting. Keyword-based context retrieval from whatever content system you use. LLM call with strict grounding instructions. Structured JSON response with sources and entities.

4. Automate FAQ Schema (1 hour)

CLI script that reads your posts, identifies gaps, generates Q&A pairs, and saves to a JSON file. Template integration that auto-renders FAQPage schema.

Total effort: approximately half a day. The ongoing cost is near zero: the knowledge API uses GPT-4o-mini (fractions of a cent per query), the FAQ generator runs only when you publish new posts, and everything else is static.

Testing Your Stack

Check ai-plugin.json

curl https://yoursite.com/.well-known/ai-plugin.json

Should return valid JSON with your capabilities declared.

Test the Knowledge API

curl -X POST https://yoursite.com/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question": "What does your company do?"}'

Should return a structured response with answer, sources, and entities.

Verify JSON-LD

View page source on your product pages. Look for @id properties and BreadcrumbList schemas.

Audit Overall Visibility

Use our free AI Crawl Checker to test your bot access, structured data, and llms.txt in one scan. The AI Readiness Score combines all signals into a single 0-100 score.

AI Discoverability Stack: Questions Readers Ask

Common questions about this topic, answered.

ai-plugin.json is a machine-readable file at /.well-known/ai-plugin.json that declares your site capabilities to AI systems. It tells AI agents where your llms.txt lives, what API endpoints are available, and links to your AI policy. Think of it as a business card for machines. Without it, AI systems have to guess what your site offers by scraping your HTML.

A knowledge API is a POST endpoint (like /api/ask) that AI systems can query directly with a question and receive a structured JSON response. The response includes an answer grounded in your content, source URLs for citation, related entity context, and a confidence score. Instead of scraping your HTML and paraphrasing (which can introduce errors), AI search engines get clean data with exact links to cite.

No. For sites with under 100 pages, simple keyword matching works well. Tokenize the question into words, match against your entity keywords and blog post metadata, rank by overlap, and send the top matches as context to an LLM. No embeddings, no vector DB, no extra infrastructure. You can add semantic search later if your content scales significantly.

Create a CLI script that reads your blog posts from your CMS, identifies posts without manual FAQPage schema, sends the content to GPT-4o-mini to extract genuine Q&A pairs, and writes results to a JSON file. Your blog page template checks this file and auto-renders FAQPage structured data for posts that have generated FAQs but no manual ones. Manual FAQs always take priority over generated ones.

Our position evolved. We allow training bots that feed AI engines we want citing us (CCBot, Google-Extended, anthropic-ai, Applebot-Extended) and block data brokers and adversarial scrapers (cohere-ai, Meta-ExternalAgent, Bytespider, Diffbot, Omgili). Browsing bots (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot) are always allowed — they read content at query time and cite you with links back. For a brand still earning ground in AI answers, recognition in future model generations matters more than blanket IP protection.

Standard JSON-LD on most sites exists in isolation: each page has its own schema that does not reference other pages. Connected JSON-LD adds @id URIs to schemas so they can reference each other. Your product references your organization by @id. Your blog posts reference products by @id. AI systems can then traverse your entire entity graph instead of treating each page as a separate data point.

Near zero. Each query to GPT-4o-mini costs fractions of a cent. The FAQ generator runs only when you publish new posts (not on every deploy). The ai-plugin.json and JSON-LD are static with zero runtime cost. The knowledge API itself is rate-limited to 10 requests per minute, so even under heavy use the monthly cost is minimal.

Use curl to test your /api/ask endpoint and verify structured responses. Check /.well-known/ai-plugin.json returns valid JSON. View page source to verify JSON-LD @id references and BreadcrumbList schemas. Use the free AI Crawl Checker at pixelmojo.io/tools/ai-crawl-checker to audit your overall AI visibility, and the AI Readiness Score tool for a unified 0-100 assessment.

What This Means for Your Site

The AI Discoverability Stack is not about gaming AI search engines. It is about making your site a better, more structured source of information that AI systems can reliably query and cite.

Every feature serves a clear purpose:

Connected JSON-LD makes your entity relationships machine-readable
ai-plugin.json declares your capabilities so AI systems don't have to guess
The Knowledge API gives AI agents a direct line instead of forcing them to parse HTML
Auto FAQ schema makes every piece of content a potential citation target

The combined effect: your site becomes a primary source that AI search engines query and cite, rather than one of billions of pages they might scrape and paraphrase. If you are earlier in the journey, start with our SEO vs GEO vs AEO guide to understand the landscape, then work through the GEO Playbook for tactical foundations before building this stack.

Want to see how your site scores for AI discoverability?

AI Crawl Checker: Test your bot access, structured data, and llms.txt
AI Readiness Score: Get a unified 0-100 score across all AI visibility signals
Contact Us: We build AI discoverability stacks for companies that want to own their narrative in AI search

The AI Search Playbook

Your Google Traffic Dropped 33%. Here's Where It Went.

The data behind the shift

SEO vs AEO vs GEO: From Ranking in Search to Becoming the Recommended Brand

Understanding the three disciplines

The GEO Playbook: How to Get Cited by ChatGPT, Perplexity, and Claude

Tactical implementation

We Optimized for AI Search. Here's What Changed.

Our own results with real data

How to Build a Brand That AI Search Engines Cite

Brand authority for AI discovery

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us(you are here)

Machine-readable APIs and schemas

Ready to become a primary source for AI search?

Audit your AI visibility for free

We build discoverability stacks

Own your narrative in AI search

About the Author

Lloyd Pilapil

Founder & AI Product Architect at Pixelmojo

Lloyd Pilapil is the founder of Pixelmojo and a senior UI/UX and growth designer with 20+ years of digital product experience and more than 30 years across visual and graphic design. His work includes projects for Salesforce, Parsons, Egis, and other public- and private-sector organizations. He builds production AI systems for B2B companies and writes about agentic AI, multi-agent orchestration, AX (Agentic Experience) design, GEO, and Thread-Based Engineering, focused on shipping AI products that generate revenue, not prototypes.

Expertise

Agentic AI SystemsMulti-Agent OrchestrationAX DesignGEO & AI SearchThread-Based EngineeringAI Product DevelopmentGrowth MarketingUI/UX Design

You Optimized for SEO. AI Search Engines Still Ignore You.

4 features

form the AI Discoverability Stack: ai-plugin.json, knowledge API, connected JSON-LD, and auto-generated FAQ schema

Source: Pixelmojo implementation, 2026

TL;DR

ai-plugin.json declares your site capabilities to AI systems in a single file
A knowledge API (/api/ask) lets AI agents query you directly instead of scraping HTML
Connected JSON-LD with @id references creates a traversable entity graph
Auto-generated FAQ schema makes every blog post a citation target
Block training bots (IP protection) but allow browsing bots (citations and traffic)
No vector database needed: keyword matching against your knowledge graph works fine

Stop being scraped. Start being queried. The AI Discoverability Stack gives AI search engines a direct line to your knowledge base.

The Core Problem: Scraped vs. Primary Source

Scraped vs. Primary Source

How AI gets your content

Scraped

Parses raw HTML

Primary Source

Structured API response

Content accuracy

Scraped

AI paraphrases (or hallucinates)

Primary Source

You control the answer

Attribution

Scraped

Maybe a link, maybe not

Primary Source

Source URLs in every response

Discoverability

Scraped

Depends on crawl luck

Primary Source

Declared in ai-plugin.json

Update speed

Scraped

Waits for next crawl

Primary Source

Real-time from knowledge graph

Feature 1: Connected JSON-LD with @id References

The Problem with Isolated Schemas

The Fix: @id URIs

Adding @id to your schemas creates anchors that other schemas can reference:

{
  "@type": "SoftwareApplication",
  "@id": "https://www.pixelmojo.io/#vector",
  "url": "https://www.pixelmojo.io/vector",
  "name": "Vector by Pixelmojo",
  "provider": {
    "@type": "Organization",
    "@id": "https://www.pixelmojo.io/#organization"
  }
}

Now when your Hive product page references Vector, it uses the @id instead of just a URL:

{
  "isBasedOn": {
    "@type": "SoftwareApplication",
    "@id": "https://www.pixelmojo.io/#vector"
  }
}

AI systems can now traverse your entire entity graph. Product references organization. Product B references Product A. Blog posts reference both. Everything connects.

BreadcrumbList for Hierarchy

We also added BreadcrumbList schema to product pages:

{
  "@type": "BreadcrumbList",
  "itemListElement": [
    { "position": 1, "name": "Home", "item": "https://www.pixelmojo.io" },
    {
      "position": 2,
      "name": "Products",
      "item": "https://www.pixelmojo.io/products"
    },
    {
      "position": 3,
      "name": "Vector",
      "item": "https://www.pixelmojo.io/vector"
    }
  ]
}

Feature 2: ai-plugin.json

What It Contains

{
  "schema_version": "v1",
  "name": "Pixelmojo",
  "description": "AI-native product studio. Build revenue-generating AI products in 90 days.",
  "url": "https://www.pixelmojo.io",
  "logo_url": "https://www.pixelmojo.io/pixelmojo-branding.svg",
  "contact_email": "founders@pixelmojo.io",
  "ai_capabilities": {
    "llms_txt": "https://www.pixelmojo.io/llms.txt",
    "llms_full_txt": "https://www.pixelmojo.io/llms-full.txt",
    "ask_api": "https://www.pixelmojo.io/api/ask",
    "policy": "https://www.pixelmojo.io/.well-known/ai-policy.json"
  },
  "api": {
    "type": "openapi",
    "endpoints": [
      {
        "path": "/api/ask",
        "method": "POST",
        "description": "Ask a question about Pixelmojo's products, services, and expertise",
        "rate_limit": "10 requests per minute"
      }
    ]
  }
}

One file. An AI agent reads it and immediately knows: where your llms.txt is, that you have a knowledge API at /api/ask, what your rate limits are, and where your usage policy lives.

Implementation

In Next.js, this is a static route handler:

// src/app/.well-known/ai-plugin.json/route.ts
import { NextResponse } from 'next/server'

export const dynamic = 'force-static'
export const revalidate = 86400

export function GET() {
  return NextResponse.json(aiPlugin, {
    headers: { 'Cache-Control': 'public, max-age=86400' },
  })
}

Static generation, 24-hour cache. Zero runtime cost.

Feature 3: The Knowledge API (/api/ask)

This is the centerpiece of the stack. Instead of making AI systems parse your HTML, you give them a structured endpoint to query directly.

The AI Discovery Chain

Step 1

ai-plugin.json

AI system discovers your capabilities

Step 2

llms.txt

Reads structured site overview

Step 3

/api/ask

Queries your knowledge directly

Step 4

Citation

Cites your content with URLs

How It Works

Input: A POST request with a question.

curl -X POST https://www.pixelmojo.io/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question": "What is Vector?"}'

LLM call: The context goes to GPT-4o-mini with strict instructions: answer from context only, never fabricate, return structured JSON.

Output: A structured response with everything an AI system needs to cite you.

/api/ask Response Structure

{

"answer": "Vector is a 12-dimension AI lead..."

"sources": [

{ "title": "...", "url": "https://..." }

]

"relatedEntities": [

{ "name": "Vector", "type": "Product" }

]

"confidence": 0.95

}

Every response includes citation URLs and entity context

Every response includes:

answer: 1-3 sentence answer grounded in your actual content
sources: Blog post URLs that the answer draws from
relatedEntities: Products, methodologies, and services connected to the question
confidence: How well the context matched the question (0 to 1)

Rate Limiting

The endpoint uses the same in-memory Map pattern as our other API routes: 10 requests per minute per IP. Enough for legitimate AI agent queries, restrictive enough to prevent abuse.

No Vector Database Required

Feature 4: Auto-Generated FAQ Schema at Scale

The problem: manually writing FAQ schema for every blog post takes approximately 15 minutes per post. With 45 posts, that is over 11 hours of work. And every new post needs it too.

The Automation Pipeline

FAQ Schema Automation Pipeline

1Contentlayer reads MDX

45 blog posts parsed

2Script identifies gaps

Skips posts with manual FAQPage

3GPT-4o-mini extracts Q&A

3-5 genuine questions per post

4FAQPage schema injected

Auto-rendered on every blog page

We built a CLI script that automates this:

Read posts from Contentlayer (the build-time CMS that processes our MDX files)
Skip posts that already have manual FAQPage schema in their frontmatter
Send content to GPT-4o-mini with instructions to extract 3-5 genuine questions the article answers
Write results to generated/faqs.json, keyed by post slug
Blog page template auto-renders FAQPage schema for any post that has generated FAQs but no manual ones

The Script

// scripts/generate-faqs.ts
// Usage:
//   npx tsx scripts/generate-faqs.ts          (generate for all missing)
//   npx tsx scripts/generate-faqs.ts --force   (regenerate all)
//   npx tsx scripts/generate-faqs.ts --slug=X  (single post)

Rate limited to 1 request per second to stay within API limits. Results are committed to git alongside the contentlayer output.

Blog Page Integration

The blog page template imports the generated FAQ JSON and checks two conditions before rendering:

The post does NOT have manual FAQPage schema in frontmatter
The post DOES have entries in generated/faqs.json

If both conditions are true, it renders a FAQPage schema from the auto-generated data. Manual FAQs always take priority.

The Numbers

45 total blog posts
~15 already have manual FAQPage schema
~30 posts gain FAQ schema automatically
90-150 new question/answer pairs indexed by search engines
Run time: under 2 minutes for all 30 posts

The Bot Strategy: Selective on Training, Open on Browsing

This stack works because of a deliberate robots.txt strategy that many sites get wrong. We covered the full bot segmentation approach in our GEO Playbook, but here is the summary.

Allowed training bots (these feed AI engines we want citing us):

CCBot (Common Crawl, used by most LLMs), Google-Extended (Gemini), anthropic-ai (Claude), Applebot-Extended (Apple Intelligence)

These bots collect content that becomes part of the model knowledge base. For a brand below the training-data threshold, this is the slow lane to citation growth.

Still blocked (data brokers and adversarial scrapers):

cohere-ai, Meta-ExternalAgent, Bytespider, Diffbot, Omgili

Data brokers resell scraped content. Bytespider feeds ByteDance products with limited Western reach. Blocking these protects against IP-resale without sacrificing meaningful citation lift.

Allowed (browsing bots — always):

GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, GoogleOther

“Allow the training bots feeding AI engines you want citing you. Block the data brokers. Always keep browsing bots open.”

The AI Bot Strategy

Implementation Checklist

If you want to build this for your own site, here is the order:

1. Connect Your JSON-LD (30 minutes)

Add @id URIs to existing schemas. Add url properties. Make provider reference your global Organization by @id. Add BreadcrumbList to product and service pages.

2. Create ai-plugin.json (15 minutes)

Static route handler at /.well-known/ai-plugin.json. Declare your llms.txt location, any API endpoints, and your AI policy. This is a one-time setup.

3. Build the Knowledge API (2 hours)

4. Automate FAQ Schema (1 hour)

CLI script that reads your posts, identifies gaps, generates Q&A pairs, and saves to a JSON file. Template integration that auto-renders FAQPage schema.

Testing Your Stack

Check ai-plugin.json

curl https://yoursite.com/.well-known/ai-plugin.json

Should return valid JSON with your capabilities declared.

Test the Knowledge API

curl -X POST https://yoursite.com/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question": "What does your company do?"}'

Should return a structured response with answer, sources, and entities.

Verify JSON-LD

View page source on your product pages. Look for @id properties and BreadcrumbList schemas.

Audit Overall Visibility

Use our free AI Crawl Checker to test your bot access, structured data, and llms.txt in one scan. The AI Readiness Score combines all signals into a single 0-100 score.

AI Discoverability Stack: Questions Readers Ask

Common questions about this topic, answered.

What This Means for Your Site

The AI Discoverability Stack is not about gaming AI search engines. It is about making your site a better, more structured source of information that AI systems can reliably query and cite.

Every feature serves a clear purpose:

Connected JSON-LD makes your entity relationships machine-readable
ai-plugin.json declares your capabilities so AI systems don't have to guess
The Knowledge API gives AI agents a direct line instead of forcing them to parse HTML
Auto FAQ schema makes every piece of content a potential citation target

Want to see how your site scores for AI discoverability?

AI Crawl Checker: Test your bot access, structured data, and llms.txt
AI Readiness Score: Get a unified 0-100 score across all AI visibility signals
Contact Us: We build AI discoverability stacks for companies that want to own their narrative in AI search

The AI Search Playbook

Your Google Traffic Dropped 33%. Here's Where It Went.

The data behind the shift

SEO vs AEO vs GEO: From Ranking in Search to Becoming the Recommended Brand

Understanding the three disciplines

The GEO Playbook: How to Get Cited by ChatGPT, Perplexity, and Claude

Tactical implementation

We Optimized for AI Search. Here's What Changed.

Our own results with real data

How to Build a Brand That AI Search Engines Cite

Brand authority for AI discovery

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us(you are here)

Machine-readable APIs and schemas

Ready to become a primary source for AI search?

Audit your AI visibility for free

We build discoverability stacks

Own your narrative in AI search

About the Author

Lloyd Pilapil

Founder & AI Product Architect at Pixelmojo

Expertise

Agentic AI SystemsMulti-Agent OrchestrationAX DesignGEO & AI SearchThread-Based EngineeringAI Product DevelopmentGrowth MarketingUI/UX Design

You Optimized for SEO. AI Search Engines Still Ignore You.

4 features

form the AI Discoverability Stack: ai-plugin.json, knowledge API, connected JSON-LD, and auto-generated FAQ schema

Source: Pixelmojo implementation, 2026

TL;DR

ai-plugin.json declares your site capabilities to AI systems in a single file
A knowledge API (/api/ask) lets AI agents query you directly instead of scraping HTML
Connected JSON-LD with @id references creates a traversable entity graph
Auto-generated FAQ schema makes every blog post a citation target
Block training bots (IP protection) but allow browsing bots (citations and traffic)
No vector database needed: keyword matching against your knowledge graph works fine

Stop being scraped. Start being queried. The AI Discoverability Stack gives AI search engines a direct line to your knowledge base.

The Core Problem: Scraped vs. Primary Source

Scraped vs. Primary Source

How AI gets your content

Scraped

Parses raw HTML

Primary Source

Structured API response

Content accuracy

Scraped

AI paraphrases (or hallucinates)

Primary Source

You control the answer

Attribution

Scraped

Maybe a link, maybe not

Primary Source

Source URLs in every response

Discoverability

Scraped

Depends on crawl luck

Primary Source

Declared in ai-plugin.json

Update speed

Scraped

Waits for next crawl

Primary Source

Real-time from knowledge graph

Feature 1: Connected JSON-LD with @id References

The Problem with Isolated Schemas

The Fix: @id URIs

Adding @id to your schemas creates anchors that other schemas can reference:

{
  "@type": "SoftwareApplication",
  "@id": "https://www.pixelmojo.io/#vector",
  "url": "https://www.pixelmojo.io/vector",
  "name": "Vector by Pixelmojo",
  "provider": {
    "@type": "Organization",
    "@id": "https://www.pixelmojo.io/#organization"
  }
}

Now when your Hive product page references Vector, it uses the @id instead of just a URL:

{
  "isBasedOn": {
    "@type": "SoftwareApplication",
    "@id": "https://www.pixelmojo.io/#vector"
  }
}

AI systems can now traverse your entire entity graph. Product references organization. Product B references Product A. Blog posts reference both. Everything connects.

BreadcrumbList for Hierarchy

We also added BreadcrumbList schema to product pages:

{
  "@type": "BreadcrumbList",
  "itemListElement": [
    { "position": 1, "name": "Home", "item": "https://www.pixelmojo.io" },
    {
      "position": 2,
      "name": "Products",
      "item": "https://www.pixelmojo.io/products"
    },
    {
      "position": 3,
      "name": "Vector",
      "item": "https://www.pixelmojo.io/vector"
    }
  ]
}

Feature 2: ai-plugin.json

What It Contains

{
  "schema_version": "v1",
  "name": "Pixelmojo",
  "description": "AI-native product studio. Build revenue-generating AI products in 90 days.",
  "url": "https://www.pixelmojo.io",
  "logo_url": "https://www.pixelmojo.io/pixelmojo-branding.svg",
  "contact_email": "founders@pixelmojo.io",
  "ai_capabilities": {
    "llms_txt": "https://www.pixelmojo.io/llms.txt",
    "llms_full_txt": "https://www.pixelmojo.io/llms-full.txt",
    "ask_api": "https://www.pixelmojo.io/api/ask",
    "policy": "https://www.pixelmojo.io/.well-known/ai-policy.json"
  },
  "api": {
    "type": "openapi",
    "endpoints": [
      {
        "path": "/api/ask",
        "method": "POST",
        "description": "Ask a question about Pixelmojo's products, services, and expertise",
        "rate_limit": "10 requests per minute"
      }
    ]
  }
}

One file. An AI agent reads it and immediately knows: where your llms.txt is, that you have a knowledge API at /api/ask, what your rate limits are, and where your usage policy lives.

Implementation

In Next.js, this is a static route handler:

// src/app/.well-known/ai-plugin.json/route.ts
import { NextResponse } from 'next/server'

export const dynamic = 'force-static'
export const revalidate = 86400

export function GET() {
  return NextResponse.json(aiPlugin, {
    headers: { 'Cache-Control': 'public, max-age=86400' },
  })
}

Static generation, 24-hour cache. Zero runtime cost.

Feature 3: The Knowledge API (/api/ask)

This is the centerpiece of the stack. Instead of making AI systems parse your HTML, you give them a structured endpoint to query directly.

The AI Discovery Chain

Step 1

ai-plugin.json

AI system discovers your capabilities

Step 2

llms.txt

Reads structured site overview

Step 3

/api/ask

Queries your knowledge directly

Step 4

Citation

Cites your content with URLs

How It Works

Input: A POST request with a question.

curl -X POST https://www.pixelmojo.io/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question": "What is Vector?"}'

LLM call: The context goes to GPT-4o-mini with strict instructions: answer from context only, never fabricate, return structured JSON.

Output: A structured response with everything an AI system needs to cite you.

/api/ask Response Structure

{

"answer": "Vector is a 12-dimension AI lead..."

"sources": [

{ "title": "...", "url": "https://..." }

]

"relatedEntities": [

{ "name": "Vector", "type": "Product" }

]

"confidence": 0.95

}

Every response includes citation URLs and entity context

Every response includes:

answer: 1-3 sentence answer grounded in your actual content
sources: Blog post URLs that the answer draws from
relatedEntities: Products, methodologies, and services connected to the question
confidence: How well the context matched the question (0 to 1)

Rate Limiting

The endpoint uses the same in-memory Map pattern as our other API routes: 10 requests per minute per IP. Enough for legitimate AI agent queries, restrictive enough to prevent abuse.

No Vector Database Required

Feature 4: Auto-Generated FAQ Schema at Scale

The problem: manually writing FAQ schema for every blog post takes approximately 15 minutes per post. With 45 posts, that is over 11 hours of work. And every new post needs it too.

The Automation Pipeline

FAQ Schema Automation Pipeline

1Contentlayer reads MDX

45 blog posts parsed

2Script identifies gaps

Skips posts with manual FAQPage

3GPT-4o-mini extracts Q&A

3-5 genuine questions per post

4FAQPage schema injected

Auto-rendered on every blog page

We built a CLI script that automates this:

Read posts from Contentlayer (the build-time CMS that processes our MDX files)
Skip posts that already have manual FAQPage schema in their frontmatter
Send content to GPT-4o-mini with instructions to extract 3-5 genuine questions the article answers
Write results to generated/faqs.json, keyed by post slug
Blog page template auto-renders FAQPage schema for any post that has generated FAQs but no manual ones

The Script

// scripts/generate-faqs.ts
// Usage:
//   npx tsx scripts/generate-faqs.ts          (generate for all missing)
//   npx tsx scripts/generate-faqs.ts --force   (regenerate all)
//   npx tsx scripts/generate-faqs.ts --slug=X  (single post)

Rate limited to 1 request per second to stay within API limits. Results are committed to git alongside the contentlayer output.

Blog Page Integration

The blog page template imports the generated FAQ JSON and checks two conditions before rendering:

The post does NOT have manual FAQPage schema in frontmatter
The post DOES have entries in generated/faqs.json

If both conditions are true, it renders a FAQPage schema from the auto-generated data. Manual FAQs always take priority.

The Numbers

45 total blog posts
~15 already have manual FAQPage schema
~30 posts gain FAQ schema automatically
90-150 new question/answer pairs indexed by search engines
Run time: under 2 minutes for all 30 posts

The Bot Strategy: Selective on Training, Open on Browsing

This stack works because of a deliberate robots.txt strategy that many sites get wrong. We covered the full bot segmentation approach in our GEO Playbook, but here is the summary.

Allowed training bots (these feed AI engines we want citing us):

CCBot (Common Crawl, used by most LLMs), Google-Extended (Gemini), anthropic-ai (Claude), Applebot-Extended (Apple Intelligence)

These bots collect content that becomes part of the model knowledge base. For a brand below the training-data threshold, this is the slow lane to citation growth.

Still blocked (data brokers and adversarial scrapers):

cohere-ai, Meta-ExternalAgent, Bytespider, Diffbot, Omgili

Data brokers resell scraped content. Bytespider feeds ByteDance products with limited Western reach. Blocking these protects against IP-resale without sacrificing meaningful citation lift.

Allowed (browsing bots — always):

GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, GoogleOther

“Allow the training bots feeding AI engines you want citing you. Block the data brokers. Always keep browsing bots open.”

The AI Bot Strategy

Implementation Checklist

If you want to build this for your own site, here is the order:

1. Connect Your JSON-LD (30 minutes)

Add @id URIs to existing schemas. Add url properties. Make provider reference your global Organization by @id. Add BreadcrumbList to product and service pages.

2. Create ai-plugin.json (15 minutes)

Static route handler at /.well-known/ai-plugin.json. Declare your llms.txt location, any API endpoints, and your AI policy. This is a one-time setup.

3. Build the Knowledge API (2 hours)

4. Automate FAQ Schema (1 hour)

CLI script that reads your posts, identifies gaps, generates Q&A pairs, and saves to a JSON file. Template integration that auto-renders FAQPage schema.

Testing Your Stack

Check ai-plugin.json

curl https://yoursite.com/.well-known/ai-plugin.json

Should return valid JSON with your capabilities declared.

Test the Knowledge API

curl -X POST https://yoursite.com/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question": "What does your company do?"}'

Should return a structured response with answer, sources, and entities.

Verify JSON-LD

View page source on your product pages. Look for @id properties and BreadcrumbList schemas.

Audit Overall Visibility

Use our free AI Crawl Checker to test your bot access, structured data, and llms.txt in one scan. The AI Readiness Score combines all signals into a single 0-100 score.

AI Discoverability Stack: Questions Readers Ask

Common questions about this topic, answered.

What This Means for Your Site

The AI Discoverability Stack is not about gaming AI search engines. It is about making your site a better, more structured source of information that AI systems can reliably query and cite.

Every feature serves a clear purpose:

Connected JSON-LD makes your entity relationships machine-readable
ai-plugin.json declares your capabilities so AI systems don't have to guess
The Knowledge API gives AI agents a direct line instead of forcing them to parse HTML
Auto FAQ schema makes every piece of content a potential citation target

Want to see how your site scores for AI discoverability?

AI Crawl Checker: Test your bot access, structured data, and llms.txt
AI Readiness Score: Get a unified 0-100 score across all AI visibility signals
Contact Us: We build AI discoverability stacks for companies that want to own their narrative in AI search

The AI Search Playbook

Your Google Traffic Dropped 33%. Here's Where It Went.

The data behind the shift

SEO vs AEO vs GEO: From Ranking in Search to Becoming the Recommended Brand

Understanding the three disciplines

The GEO Playbook: How to Get Cited by ChatGPT, Perplexity, and Claude

Tactical implementation

We Optimized for AI Search. Here's What Changed.

Our own results with real data

How to Build a Brand That AI Search Engines Cite

Brand authority for AI discovery

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us(you are here)

Machine-readable APIs and schemas

Ready to become a primary source for AI search?

Audit your AI visibility for free

We build discoverability stacks

Own your narrative in AI search

You Optimized for SEO. AI Search Engines Still Ignore You.

TL;DR

The Core Problem: Scraped vs. Primary Source

Scraped vs. Primary Source

Feature 1: Connected JSON-LD with @id References

The Problem with Isolated Schemas

The Fix: @id URIs

BreadcrumbList for Hierarchy

Feature 2: ai-plugin.json

What It Contains

Implementation

Feature 3: The Knowledge API (/api/ask)

The AI Discovery Chain

How It Works

/api/ask Response Structure

Rate Limiting

No Vector Database Required

Feature 4: Auto-Generated FAQ Schema at Scale

The Automation Pipeline

FAQ Schema Automation Pipeline

The Script

Blog Page Integration

The Numbers

The Bot Strategy: Selective on Training, Open on Browsing

Implementation Checklist

1. Connect Your JSON-LD (30 minutes)

2. Create ai-plugin.json (15 minutes)

3. Build the Knowledge API (2 hours)

4. Automate FAQ Schema (1 hour)

Testing Your Stack

Check ai-plugin.json

Test the Knowledge API

Verify JSON-LD

Audit Overall Visibility

AI Discoverability Stack: Questions Readers Ask

What This Means for Your Site

Ready to become a primary source for AI search?

About the Author

Lloyd Pilapil

Related Reading

You Optimized for SEO. AI Search Engines Still Ignore You.

TL;DR

The Core Problem: Scraped vs. Primary Source

Scraped vs. Primary Source

Feature 1: Connected JSON-LD with @id References

The Problem with Isolated Schemas

The Fix: @id URIs

BreadcrumbList for Hierarchy

Feature 2: ai-plugin.json

What It Contains

Implementation

Feature 3: The Knowledge API (/api/ask)

The AI Discovery Chain

How It Works

/api/ask Response Structure

Rate Limiting

No Vector Database Required

Feature 4: Auto-Generated FAQ Schema at Scale

The Automation Pipeline

FAQ Schema Automation Pipeline

The Script

Blog Page Integration

The Numbers

The Bot Strategy: Selective on Training, Open on Browsing

Implementation Checklist

1. Connect Your JSON-LD (30 minutes)

2. Create ai-plugin.json (15 minutes)

3. Build the Knowledge API (2 hours)

4. Automate FAQ Schema (1 hour)

Testing Your Stack

Check ai-plugin.json

Test the Knowledge API

Verify JSON-LD

Audit Overall Visibility

AI Discoverability Stack: Questions Readers Ask

What This Means for Your Site

Ready to become a primary source for AI search?

About the Author

Lloyd Pilapil

Related Reading