AI-Moderated Interviews: Literature Review

Part 1

At a Glance

Studies Tracked

18

academic + working papers

Prefer AI

55%

of respondents, over humans

Coding Consistency

0.97

LLM correlation across runs

Human Coder Baseline

0.75

inter-coder correlation

Opportunity

Threat

Concern

Benefit

Context

Part 2

The Story So Far

1 / 7

Part 3

The Full Corpus

Every paper tracked in this review. Tier 1 = essential reading. Tier 2 = important context. Tier 3 = background reference.

Tier	Study	Year	Focus	Key Finding
Tier 1	Geiecke & Jaravel LSE & CEPR	2026	AI vs. expert interviews; 5 large-scale studies; text & voice	AI approaches expert-level quality. 55% of respondents prefer AI. Open-source platform released.
Tier 1	Cuevas et al. Working Paper	2025	LLM interviewer vs. static follow-up baseline	Adaptive AI wins on satisfaction. Does not win on qualitative richness of data.
Tier 1	Guven et al. Working Paper	2025	AI vs. trained-student interviewers; content & quality	AI matches junior interviewers on quality. The realistic commercial comparator.
Tier 1	Zhang et al. Working Paper	2024	LLM use by respondents on online survey platforms	Respondents using ChatGPT to answer surveys is a growing, hard-to-detect problem.
Tier 1	Park et al. arXiv	2024	Generative agent simulations of 1,000 real people	Simulated agents match real people's survey responses. A conceptual alternative to real interviews.
Tier 2	Duraj et al. Working Paper	2025	Expert human interviews on stock market participation	The "ground truth" study that AI interviews later replicated, confirming validity.
Tier 2	Wuttke et al. Working Paper	2025	Small-scale AI vs. human with university students	Earliest comparison, but limited by small convenience sample of interview-interested students.
Tier 2	Stantcheva Annual Review of Economics	2023	Comprehensive guide to survey design in economics	The methodological foundation. AI interviews are the logical next step beyond open-ended fields.
Tier 2	Ferrario & Stantcheva AEA Papers & Proceedings	2022	Text analysis of open-ended survey questions at scale	The "why now" paper: NLP made open-ended questions viable. LLMs make interviews viable.
Tier 2	Ludwig & Mullainathan Quarterly Journal of Economics	2024	Machine learning for hypothesis generation	ML can generate research hypotheses from qualitative data. A premium feature for AI interview platforms.
Tier 2	Manning, Zhu & Horton NBER Working Paper	2024	End-to-end automated social science	AI can automate hypothesis → experiment → analysis. Interview platforms could own the data-collection layer.
Tier 3	Horton NBER Working Paper	2023	LLMs as simulated economic agents (Homo Silicus)	Can AI simulate respondents? Conceptual framing for why real interviews still matter.
Tier 3	Korinek Journal of Economic Literature	2023	Generative AI for economic research: use cases	Broad review of how generative AI is being used across economics research. Establishes academic legitimacy.
Tier 3	Small & McCrory Calarco UC Press (Book)	2022	Evaluating quality in qualitative research	The standard for qualitative rigor. AI platforms will be judged against these frameworks.
Tier 3	Fernando et al. arXiv	2023	Promptbreeder: automated prompt evolution	Self-improving prompts. Future direction for auto-optimizing interview guides.
Tier 3	Dominguez-Olmedo et al. Working Paper	2023	LLMs answering surveys as simulated respondents	What happens when AI answers its own questions. Contamination and validity concerns.
Tier 3	Lagakos, Michalopoulos & Voth NBER Working Paper	2025	Large-scale qualitative data in economics (historical life histories)	Demonstrates appetite for qualitative-at-scale in top research. Market validation.
Tier 3	Tranchero et al. Working Paper	2024	LLMs for theory building from qualitative data	AI can help build theory from interview transcripts. Analysis-layer opportunity.

Part 4

The Five Studies That Matter Most

Detailed summaries of the papers with the highest strategic relevance.

Conversations at Scale: Robust AI-Led Interviews

2026

Friedrich Geiecke & Xavier Jaravel, LSE & CEPR

SSRN Working Paper (February 2026) • 185 pages • Open-source platform released

The most rigorous validation of AI-moderated interviews to date. The authors built a single-agent LLM interviewing platform and tested it across five large-scale studies covering decision-making, political views, subjective well-being, and mental models of public policy. They benchmarked against face-to-face interviews conducted by trained sociologists, not students, not online moderators.

The platform uses a simple system prompt (no multi-agent architecture), supports both text and voice via native audio LLMs, and is released as open-source code. They tested GPT-4, GPT-4o, Llama, and other models.

Bottom line: AI interviews approach expert quality, respondents prefer them (55%), analysis is more consistent than human coding (0.97 vs 0.75), and voice modality works. The authors position this as a complement to traditional methods, not a replacement. They flag sampling contamination (respondents using LLMs), model discontinuation risks, and the irreplaceability of deep ethnographic rapport.

Opportunity Benefit

LLM Interviewer vs. Naive Baseline

2025

Cuevas et al.

Working Paper (2025)

Tested whether an LLM that adapts follow-up questions based on participant responses outperforms a "naive baseline" where follow-ups are always the same. This isolates the specific value of adaptive probing, the thing that distinguishes AI-moderated interviews from structured surveys.

Bottom line: Respondents were more satisfied with the adaptive AI interviewer. But the data it collected did not score higher on "qualitative richness" metrics. This is an important distinction: people enjoy the conversation more, but enjoyment doesn't automatically produce better research data. Platforms must measure both and not conflate them.

Concern Benefit

Generative Agent Simulations of 1,000 People

2024

Park, Zou, Shaw, Hill, Cai, Morris, Willer, Liang & Bernstein

arXiv: 2411.10109

Built generative agents that simulate 1,000 real individuals. Each agent is initialized with two hours of interview data and then matches the real person's survey responses, personality measures, and experimental behavior with meaningful fidelity.

Bottom line: The most developed version of the "why interview real people if you can simulate them?" argument. The paper's own limitations are revealing: simulations match stated preferences but not deep motivations, contradictions, or surprises. AI-moderated interviews surface the things simulations cannot: the unexpected, the contradictory, the genuinely human.

Threat

LLM Use by Respondents on Survey Platforms

2024

Zhang et al.

Working Paper (2024)

Investigated a growing problem: respondents on platforms like Prolific and MTurk are using ChatGPT to generate their answers to open-ended survey questions. As LLMs improve, detecting this becomes harder.

Bottom line: This is both the biggest data-quality threat to AI-moderated research and the best argument for it. Static text fields can't probe for consistency. An AI interviewer that asks follow-up questions, requests specifics, and checks for contradictions is inherently better at detecting LLM-generated responses. The threat creates the demand for the solution.

Threat Concern

AI-Led vs. Human-Led Chat-Based Interviews

2025

Guven, Gardhus, Bjerre-Nielsen & Carlsen

Working Paper (2025)

Compared AI-led interviews against online interviews conducted by trained students, not expert sociologists. Focused on both content quality and respondent experience. This is the most commercially relevant comparator because most real-world research is conducted by relatively junior teams.

Bottom line: The question isn't whether AI beats the world's best human interviewers. It's whether AI beats the average research team. This paper addresses that more realistic benchmark.

Opportunity Concern

Part 5

What This Means

What the literature, taken together, means for this market.

Opportunities

Academic market is ready. Open-source tools exist but lack polish, support, and features researchers actually need.
Voice-native modality. Most tools are text-first. Native voice AI is a clear differentiator and respondent preference.
Auto hypothesis generation. Transcripts can be fed to LLMs to surface novel research hypotheses, making this a premium-tier feature.
Enterprise qual at scale. Literature focuses on social science. UX research, brand, CX, and customer discovery are untouched.
LLM-response detection. Adaptive questioning is inherently better at detecting fake responses than static surveys.

Threats

Open-source competition. The seed paper released a full platform. Technical teams can replicate core functionality for free.
Simulated respondents. If generative agent simulations improve, some buyers may skip real interviews entirely.
Commoditization. A single-prompt architecture means low barriers to entry. Moats must be in UX, analysis, and trust.
Model dependency. Building on third-party LLMs creates pricing, capability, and discontinuation exposure.

Evidence-Backed Claims

AI interviews approach expert-level quality across multiple validated metrics
55% of respondents prefer AI interviewers for future studies
Automated transcript analysis is 29% more consistent than human coding
AI interviews replicate findings from expert human interviews
Voice AI interviews are viable and preferred by a majority of participants
Adaptive questioning improves respondent experience over static follow-ups

Open Concerns

Satisfaction ≠ richness. People liking the AI doesn't automatically mean richer data.
Thin evidence base. One landmark paper, several small working papers. The field needs replication.
Non-verbal cues are lost. Even voice AI can't read body language or build trust through presence.
Bias in probing. LLM biases could systematically shape follow-up questions across demographic groups.
AI-on-AI risk. If respondents and interviewers are both AI, the research loop closes on itself.

Part 6

What Nobody Has Studied Yet

Gaps in the literature that represent both open questions and research partnership opportunities.

1

Non-WEIRD populations. Every published study uses Western, Educated, Industrialized, Rich, Democratic samples recruited from Prolific or similar platforms. AI moderation has not been validated with diverse global populations, lower-literacy groups, or elderly respondents.

2

Sensitive topics. No study has tested AI moderation on trauma, addiction, abuse, health, or illegal behavior, all topics where interviewer trust and rapport are most critical.

3

Commercial use cases. The entire literature is academic social science. No published validation exists for UX research, brand tracking, customer discovery, or market research applications.

4

Longitudinal comparison. No study has compared AI vs. human interviews across the same respondents over time (panel studies) to measure consistency and depth evolution.

5

Interviewer "personality" effects. Human interviewer demographics and warmth affect responses. Whether the AI's apparent personality, tone, or perceived identity similarly affects outcomes is untested.

6

Cost-effectiveness analysis. Everyone claims AI is cheaper. No one has published a rigorous comparison accounting for platform costs, sampling, analysis time, and quality-adjusted outcomes.