Here's where the practitioner playbook diverges from the commentary. If you want to show up in AI answers, you need to understand how LLMs acquire and retrieve information. There are two distinct pipelines, and they work differently.
Pipeline 1: Training Data (Common Crawl and Harmonic Centrality)
Most LLMs are built on Common Crawl, a public web archive that's been indexing the internet since 2008. The numbers are stark: 64% of LLMs use Common Crawl data. GPT-3 drew over 80% of its training tokens from it.
But Common Crawl doesn't treat all pages equally. It prioritizes what to crawl using a metric called Harmonic Centrality (HC) – essentially a measure of how central a domain is in the web graph based on its link structure. Higher HC means more frequent crawling and deeper page coverage, which means more of your content ends up in training data.
This has a practical consequence. A dofollow link from a high-HC domain doesn't just help your traditional search rankings – it shifts your position in the web graph that determines how much of your content LLMs will train on.
The top of the list won't surprise you: google.com, youtube.com, wikipedia.org, github.com, wordpress.org. But the operative insight is that these are the domains whose link neighborhoods determine what gets into training data. A dofollow link from wordpress.org (HC #13) doesn't just pass PageRank – it tells Common Crawl's prioritization algorithm to crawl your site more thoroughly.
Pipeline 2: Real-Time Citation (Google-Extended)
Training data is a slow-moving signal – models retrain on cycles, not daily. The real-time pipeline works through Google-Extended, a crawler that's separate from the traditional Googlebot. It reads raw HTML (not rendered JavaScript), maintains its own index, and feeds directly into Gemini's answer generation.
This is the pipeline that determines whether you get cited with a link in an AI answer today, not six months from now when the next model is trained.
Key detail: Google-Extended and Googlebot are independent systems. Being indexed by Googlebot doesn't mean Google-Extended has your content. And because Google-Extended reads raw HTML, content that's hidden behind JavaScript rendering may not be visible to this pipeline at all.
What the Citation Data Shows
Research analyzing over 150,000 LLM citations reveals patterns that should shape strategy:
Reddit accounts for 40.1% of all LLM citations. Wikipedia is second at 26.3%. This is not intuitive – but it makes sense when you consider that LLMs value recently discussed, contextually rich, human-generated content.
Position 1 in Google correlates with roughly a 46-48% probability of LLM citation. Traditional ranking still matters, but it's not the whole picture.
37% of domains cited by AI don't appear in the traditional SERP at all. This is the most underappreciated finding. You can be invisible in classic search and still get cited by Gemini if your content appears on high-authority platforms that LLMs reference.
The practical takeaway: traditional SERP optimization and AI citation optimization overlap but aren't identical. Ranking well helps. But entity presence on high-HC domains – through genuine mentions, UGC contributions, marketplace listings – creates a separate citation pathway.