The Observatory × Dialogue AI | Lit Review & Repository May 2026
White Paper Repository
AI-Moderated Interviews:
What the Literature Says
A sweep of published and working papers on using large language models to conduct qualitative interviews at scale. What works, what doesn't, and what it means.
At a Glance
Studies Tracked
18
academic + working papers
Prefer AI
55%
of respondents, over humans
Coding Consistency
0.97
LLM correlation across runs
Human Coder Baseline
0.75
inter-coder correlation
Opportunity
Threat
Concern
Benefit
Context
The Story So Far
1 / 7
The Full Corpus
Every paper tracked in this review. Tier 1 = essential reading. Tier 2 = important context. Tier 3 = background reference.
Tier Study Year Focus Key Finding Tags
Tier 1 Geiecke & Jaravel
LSE & CEPR
2026 AI vs. expert interviews; 5 large-scale studies; text & voice AI approaches expert-level quality. 55% of respondents prefer AI. Open-source platform released.
Tier 1 Cuevas et al.
Working Paper
2025 LLM interviewer vs. static follow-up baseline Adaptive AI wins on satisfaction. Does not win on qualitative richness of data.
Tier 1 Guven et al.
Working Paper
2025 AI vs. trained-student interviewers; content & quality AI matches junior interviewers on quality. The realistic commercial comparator.
Tier 1 Zhang et al.
Working Paper
2024 LLM use by respondents on online survey platforms Respondents using ChatGPT to answer surveys is a growing, hard-to-detect problem.
Tier 1 Park et al.
arXiv
2024 Generative agent simulations of 1,000 real people Simulated agents match real people's survey responses. A conceptual alternative to real interviews.
Tier 2 Duraj et al.
Working Paper
2025 Expert human interviews on stock market participation The "ground truth" study that AI interviews later replicated, confirming validity.
Tier 2 Wuttke et al.
Working Paper
2025 Small-scale AI vs. human with university students Earliest comparison, but limited by small convenience sample of interview-interested students.
Tier 2 Stantcheva
Annual Review of Economics
2023 Comprehensive guide to survey design in economics The methodological foundation. AI interviews are the logical next step beyond open-ended fields.
Tier 2 Ferrario & Stantcheva
AEA Papers & Proceedings
2022 Text analysis of open-ended survey questions at scale The "why now" paper: NLP made open-ended questions viable. LLMs make interviews viable.
Tier 2 Ludwig & Mullainathan
Quarterly Journal of Economics
2024 Machine learning for hypothesis generation ML can generate research hypotheses from qualitative data. A premium feature for AI interview platforms.
Tier 2 Manning, Zhu & Horton
NBER Working Paper
2024 End-to-end automated social science AI can automate hypothesis → experiment → analysis. Interview platforms could own the data-collection layer.
Tier 3 Horton
NBER Working Paper
2023 LLMs as simulated economic agents (Homo Silicus) Can AI simulate respondents? Conceptual framing for why real interviews still matter.
Tier 3 Korinek
Journal of Economic Literature
2023 Generative AI for economic research: use cases Broad review of how generative AI is being used across economics research. Establishes academic legitimacy.
Tier 3 Small & McCrory Calarco
UC Press (Book)
2022 Evaluating quality in qualitative research The standard for qualitative rigor. AI platforms will be judged against these frameworks.
Tier 3 Fernando et al.
arXiv
2023 Promptbreeder: automated prompt evolution Self-improving prompts. Future direction for auto-optimizing interview guides.
Tier 3 Dominguez-Olmedo et al.
Working Paper
2023 LLMs answering surveys as simulated respondents What happens when AI answers its own questions. Contamination and validity concerns.
Tier 3 Lagakos, Michalopoulos & Voth
NBER Working Paper
2025 Large-scale qualitative data in economics (historical life histories) Demonstrates appetite for qualitative-at-scale in top research. Market validation.
Tier 3 Tranchero et al.
Working Paper
2024 LLMs for theory building from qualitative data AI can help build theory from interview transcripts. Analysis-layer opportunity.
The Five Studies That Matter Most
Detailed summaries of the papers with the highest strategic relevance.
Friedrich Geiecke & Xavier Jaravel, LSE & CEPR
SSRN Working Paper (February 2026) • 185 pages • Open-source platform released

The most rigorous validation of AI-moderated interviews to date. The authors built a single-agent LLM interviewing platform and tested it across five large-scale studies covering decision-making, political views, subjective well-being, and mental models of public policy. They benchmarked against face-to-face interviews conducted by trained sociologists, not students, not online moderators.

The platform uses a simple system prompt (no multi-agent architecture), supports both text and voice via native audio LLMs, and is released as open-source code. They tested GPT-4, GPT-4o, Llama, and other models.

Bottom line: AI interviews approach expert quality, respondents prefer them (55%), analysis is more consistent than human coding (0.97 vs 0.75), and voice modality works. The authors position this as a complement to traditional methods, not a replacement. They flag sampling contamination (respondents using LLMs), model discontinuation risks, and the irreplaceability of deep ethnographic rapport.
Opportunity Benefit
Cuevas et al.
Working Paper (2025)

Tested whether an LLM that adapts follow-up questions based on participant responses outperforms a "naive baseline" where follow-ups are always the same. This isolates the specific value of adaptive probing, the thing that distinguishes AI-moderated interviews from structured surveys.

Bottom line: Respondents were more satisfied with the adaptive AI interviewer. But the data it collected did not score higher on "qualitative richness" metrics. This is an important distinction: people enjoy the conversation more, but enjoyment doesn't automatically produce better research data. Platforms must measure both and not conflate them.
Concern Benefit
Park, Zou, Shaw, Hill, Cai, Morris, Willer, Liang & Bernstein
arXiv: 2411.10109

Built generative agents that simulate 1,000 real individuals. Each agent is initialized with two hours of interview data and then matches the real person's survey responses, personality measures, and experimental behavior with meaningful fidelity.

Bottom line: The most developed version of the "why interview real people if you can simulate them?" argument. The paper's own limitations are revealing: simulations match stated preferences but not deep motivations, contradictions, or surprises. AI-moderated interviews surface the things simulations cannot: the unexpected, the contradictory, the genuinely human.
Threat
Zhang et al.
Working Paper (2024)

Investigated a growing problem: respondents on platforms like Prolific and MTurk are using ChatGPT to generate their answers to open-ended survey questions. As LLMs improve, detecting this becomes harder.

Bottom line: This is both the biggest data-quality threat to AI-moderated research and the best argument for it. Static text fields can't probe for consistency. An AI interviewer that asks follow-up questions, requests specifics, and checks for contradictions is inherently better at detecting LLM-generated responses. The threat creates the demand for the solution.
Threat Concern
Guven, Gardhus, Bjerre-Nielsen & Carlsen
Working Paper (2025)

Compared AI-led interviews against online interviews conducted by trained students, not expert sociologists. Focused on both content quality and respondent experience. This is the most commercially relevant comparator because most real-world research is conducted by relatively junior teams.

Bottom line: The question isn't whether AI beats the world's best human interviewers. It's whether AI beats the average research team. This paper addresses that more realistic benchmark.
Opportunity Concern
What This Means
What the literature, taken together, means for this market.

Opportunities

  • Academic market is ready. Open-source tools exist but lack polish, support, and features researchers actually need.
  • Voice-native modality. Most tools are text-first. Native voice AI is a clear differentiator and respondent preference.
  • Auto hypothesis generation. Transcripts can be fed to LLMs to surface novel research hypotheses, making this a premium-tier feature.
  • Enterprise qual at scale. Literature focuses on social science. UX research, brand, CX, and customer discovery are untouched.
  • LLM-response detection. Adaptive questioning is inherently better at detecting fake responses than static surveys.

Threats

  • Open-source competition. The seed paper released a full platform. Technical teams can replicate core functionality for free.
  • Simulated respondents. If generative agent simulations improve, some buyers may skip real interviews entirely.
  • Commoditization. A single-prompt architecture means low barriers to entry. Moats must be in UX, analysis, and trust.
  • Model dependency. Building on third-party LLMs creates pricing, capability, and discontinuation exposure.

Evidence-Backed Claims

  • AI interviews approach expert-level quality across multiple validated metrics
  • 55% of respondents prefer AI interviewers for future studies
  • Automated transcript analysis is 29% more consistent than human coding
  • AI interviews replicate findings from expert human interviews
  • Voice AI interviews are viable and preferred by a majority of participants
  • Adaptive questioning improves respondent experience over static follow-ups

Open Concerns

  • Satisfaction ≠ richness. People liking the AI doesn't automatically mean richer data.
  • Thin evidence base. One landmark paper, several small working papers. The field needs replication.
  • Non-verbal cues are lost. Even voice AI can't read body language or build trust through presence.
  • Bias in probing. LLM biases could systematically shape follow-up questions across demographic groups.
  • AI-on-AI risk. If respondents and interviewers are both AI, the research loop closes on itself.
What Nobody Has Studied Yet
Gaps in the literature that represent both open questions and research partnership opportunities.
1
Non-WEIRD populations. Every published study uses Western, Educated, Industrialized, Rich, Democratic samples recruited from Prolific or similar platforms. AI moderation has not been validated with diverse global populations, lower-literacy groups, or elderly respondents.
2
Sensitive topics. No study has tested AI moderation on trauma, addiction, abuse, health, or illegal behavior, all topics where interviewer trust and rapport are most critical.
3
Commercial use cases. The entire literature is academic social science. No published validation exists for UX research, brand tracking, customer discovery, or market research applications.
4
Longitudinal comparison. No study has compared AI vs. human interviews across the same respondents over time (panel studies) to measure consistency and depth evolution.
5
Interviewer "personality" effects. Human interviewer demographics and warmth affect responses. Whether the AI's apparent personality, tone, or perceived identity similarly affects outcomes is untested.
6
Cost-effectiveness analysis. Everyone claims AI is cheaper. No one has published a rigorous comparison accounting for platform costs, sampling, analysis time, and quality-adjusted outcomes.