Why Information Retrieval is the Only LLM Metric That Matters

November 5, 2025
Sciencia Consulting
AI

Author: Kyle Alba, Data Science Major, 2025 SDSU alum

Top 3 Things You’ll Learn in this Blog:

How major LLMs like ChatGPT, Claude, Gemini, Perplexity, and Grok differ in how they retrieve, filter, and use information to answer queries.
The role of human annotation and feedback in shaping how each model thinks.
The four main types of information retrieval — static, augmented, RAG, and multimodal — and how each affects accuracy, depth, and context in AI responses.

Large Language Models (LLMs) have become central to how we find, filter, and act on information. But the way each model retrieves and processes data differs: some lean heavily on pretraining, others augment with live search, and some specialize in multimodal reasoning. All LLMs generate text. But how they retrieve information shapes whether that text is accurate, current, and useful, and can be used in your specific workflow needs.

This blog breaks down how ChatGPT, Grok, Gemini, Perplexity, and Claude approach data sources, human annotation, and information retrieval.

Model Training and Data Sources

Before we discuss information retrieval, it’s important to understand how each model is trained and where they get their data. Here’s a short summary of each tool, model training, and data source.

ChatGPT (OpenAI)
Primarily trained on a massive mixture of licensed data, publicly available text, and human feedback. Retrieval plugins or browsing add web augmentation when enabled, but by default, it leans on its pretrained knowledge.
Claude (Anthropic)
Trained with Constitutional AI principles, emphasizing safety and interpretability. Like ChatGPT, Claude uses pretraining corpora, but Anthropic highlights human annotation and iterative fine-tuning as central to its alignment.
Gemini (Google DeepMind)
Distinct for multimodal capabilities (text, images, code). Its data sources span large-scale web corpora, licensed content, and Google’s structured data ecosystem. Gemini is designed to combine reasoning with retrieval from multiple modalities.
Perplexity
The most “search-engine-like.” It integrates real-time web queries with LLM reasoning. Every answer is tied to sources, making it strong for tasks where citation and freshness matter.
Grok (xAI)
Built for long-context reasoning and cultural alignment with X (formerly Twitter). Grok draws heavily on public internet data, X-platform discourse, and reasoning-heavy fine-tuning. It emphasizes explanation and critical debate.

The Role of Human Annotation

Human annotation is the backbone across all five LLMs:

Human annotation is the backbone of how and why LLMs think the way they do.

With human annotation, we train these AI models to think the way we do; thus, choosing what subject to reinforce is imperative to the performance of each AI model. Each model has a certain aspect of the human mind, whether that be ethics, creativity, reasoning, or even comedy. An LLM is largely influenced by what we as humans teach it.

OpenAI (ChatGPT) uses Reinforcement Learning with Human Feedback (RLHF) to align outputs. Given prompts for images or questions, it is asked to answer itself, and if the proper response isn’t given, it is then nudged by a human monitoring its responses.
Anthropic (Claude) emphasizes Constitutional AI, where human annotation is paired with a “constitution” of ethical principles. Claude was trained specifically on ethics and morals; these values were reinforced by humans training the model’s responses to fit these parameters.
Gemini integrates reinforcement learning with multimodal feedback (rating not just text, but images or code). Similar to ChatGPT, Gemini is meant to be a tool for everything, rather than just given prompts or texts; it was trained on images, code, and literature. Each response was then reviewed and given feedback to improve its own thinking.
Perplexity uses less direct annotation; it’s more about integrating retrieval pipelines where humans contribute to validating responses and resources. Perplexity’s algorithm focuses on data retrieval, meaning that the main functionality is looking for sources and summarizing. Similar to the top of a Google search bar, human annotation is thought to have been used to make sure all resources gathered were accurate.
Grok blends annotation with platform context; the feedback loop from engaged users on X helps shape its reasoning style. Grok is the idea of a LLm that is getting close to human speech and texting, trained on tons of twitter/X tweets and humans nudging its responses to conform with its principles, since Grok was trained on tons of human response it seems more lifelike than the others.

What Is Information Retrieval?

Information retrieval (IR) in the context of Large Language Models refers to how an AI system finds, selects, and uses relevant data to answer a query. It can mean:

Searching inside the model’s pretrained memory (static knowledge).
Augmenting with external live data (search engines, databases, APIs).
Filtering, ranking, and citing results so they’re relevant, trustworthy, and current.

Retrieval Levels in LLMs

Different LLMs have different ‘levels’ of info retrieval which impact the performance. Here’s a quick breakdown of different tiers in each LLM reviewed.

Level 0: Static retrieval (pretrained memory)
The model answers only from weights and the training corpus. No live lookup.

GPT-3.5 style offline models
Local LLM inference without a retriever

Level 1: Augmented retrieval (external search)
The model supplements answers using web search engines or APIs. Still shallow context.

ChatGPT 4o with web
Perplexity basic queries

Level 2: RAG retrieval (retrieval-augmented generation)
The model selects specific documents or embeddings, brings them into context, and generates from the retrieved text rather than hallucinating filler.

Claude with projects
ChatGPT with persistent “Collections”
Perplexity Pro deep citations

Level 3: Multimodal retrieval
The system retrieves across multiple formats and fuses them during reasoning. For example text plus images plus code plus video frames.

Gemini 2.0
OpenAI o3 with image + code lookups
Anthropic multimodal Claude (beta)

Information Retrieval Types at a Glance

Model	Retrieval Type	Strengths	Weaknesses
ChatGPT	Pretrained + optional web	Polished writing, summaries	Agreeable
Claude	Pretrained + human-guided	Step-by-step logic, safe	Very Strict
Gemini	Multimodal hybrid	Science & multimodal depth	Still a growing model
Perplexity	Real-time web retrieval	Citations, freshness	Relies on web queries
Grok	Context + cultural feedback	Debate, context-rich	Less precise

Wrapping Up: Choosing the Right AI for Retrieval

Data retrieval hinges on context and depth rather than speed alone. The right model depends on the type of information you are trying to surface and the level of structure you need.

ChatGPT is strongest when the goal is polished writing and clear information framing. Claude excels when you need structured logic, code walk throughs, or an explanation given step by step. Gemini specializes in multimodal retrieval and is most effective when the task spans text, images, charts, or scientific interpretation. Perplexity is optimized for freshness and verifiable sourcing, which makes it valuable for web research and citation. Grok is tuned for reasoning and interpretation, which makes it useful for analysis where deeper inference matters more than summarization.

By understanding how each model retrieves and processes data, you can build a workflow where each tool complements the others. The future isn’t about choosing one, it’s about mixing and matching to get the best from all.

References

Jha, Ashu. “Where Does ChatGPT Get Its Knowledge? The Untold Story of Data That Built an AI.” Medium, https://medium.com/@ashujha44/where-does-chatgpt-get-its-knowledge-the-untold-story-of-data-that-built-an-ai-73558013097d.
“Claude AI 101: What It Is and How It Works.” Grammarly Blog, Grammarly, https://www.grammarly.com/blog/ai/what-is-claude-ai/.
Google DeepMind. Introducing Gemini: Our Next-Generation AI Model. (2023). https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0
Perplexity AI. How Perplexity Works. (2024). https://www.perplexity.ai/help-center/en/articles/10352895-how-does-perplexity-work
Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv:2203.02155 (2022). https://arxiv.org/abs/2203.02155
Bai, Y. et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 (2022). https://arxiv.org/abs/2212.08073

Ready to optimize marketing initiatives and cut costs?

Let’s set up a time to discuss your situation and find the best options for your business growth.

Why Information Retrieval is the Only LLM Metric That Matters

Model Training and Data Sources

The Role of Human Annotation

What Is Information Retrieval?

Retrieval Levels in LLMs

Information Retrieval Types at a Glance

Wrapping Up: Choosing the Right AI for Retrieval

References

Ready to optimize marketing initiatives and cut costs?

Join our LinkedIn group with over 9500 members:

Subscribe to our FREE newsletter for expert insights on integrating AI into your life science digital marketing workflows

Related Posts