| Tier | Study | Year | Focus | Key Finding | Tags |
|---|---|---|---|---|---|
| Tier 1 | Geiecke & Jaravel LSE & CEPR |
2026 | AI vs. expert interviews; 5 large-scale studies; text & voice | AI approaches expert-level quality. 55% of respondents prefer AI. Open-source platform released. | |
| Tier 1 | Cuevas et al. Working Paper |
2025 | LLM interviewer vs. static follow-up baseline | Adaptive AI wins on satisfaction. Does not win on qualitative richness of data. | |
| Tier 1 | Guven et al. Working Paper |
2025 | AI vs. trained-student interviewers; content & quality | AI matches junior interviewers on quality. The realistic commercial comparator. | |
| Tier 1 | Zhang et al. Working Paper |
2024 | LLM use by respondents on online survey platforms | Respondents using ChatGPT to answer surveys is a growing, hard-to-detect problem. | |
| Tier 1 | Park et al. arXiv |
2024 | Generative agent simulations of 1,000 real people | Simulated agents match real people's survey responses. A conceptual alternative to real interviews. | |
| Tier 2 | Duraj et al. Working Paper |
2025 | Expert human interviews on stock market participation | The "ground truth" study that AI interviews later replicated, confirming validity. | |
| Tier 2 | Wuttke et al. Working Paper |
2025 | Small-scale AI vs. human with university students | Earliest comparison, but limited by small convenience sample of interview-interested students. | |
| Tier 2 | Stantcheva Annual Review of Economics |
2023 | Comprehensive guide to survey design in economics | The methodological foundation. AI interviews are the logical next step beyond open-ended fields. | |
| Tier 2 | Ferrario & Stantcheva AEA Papers & Proceedings |
2022 | Text analysis of open-ended survey questions at scale | The "why now" paper: NLP made open-ended questions viable. LLMs make interviews viable. | |
| Tier 2 | Ludwig & Mullainathan Quarterly Journal of Economics |
2024 | Machine learning for hypothesis generation | ML can generate research hypotheses from qualitative data. A premium feature for AI interview platforms. | |
| Tier 2 | Manning, Zhu & Horton NBER Working Paper |
2024 | End-to-end automated social science | AI can automate hypothesis → experiment → analysis. Interview platforms could own the data-collection layer. | |
| Tier 3 | Horton NBER Working Paper |
2023 | LLMs as simulated economic agents (Homo Silicus) | Can AI simulate respondents? Conceptual framing for why real interviews still matter. | |
| Tier 3 | Korinek Journal of Economic Literature |
2023 | Generative AI for economic research: use cases | Broad review of how generative AI is being used across economics research. Establishes academic legitimacy. | |
| Tier 3 | Small & McCrory Calarco UC Press (Book) |
2022 | Evaluating quality in qualitative research | The standard for qualitative rigor. AI platforms will be judged against these frameworks. | |
| Tier 3 | Fernando et al. arXiv |
2023 | Promptbreeder: automated prompt evolution | Self-improving prompts. Future direction for auto-optimizing interview guides. | |
| Tier 3 | Dominguez-Olmedo et al. Working Paper |
2023 | LLMs answering surveys as simulated respondents | What happens when AI answers its own questions. Contamination and validity concerns. | |
| Tier 3 | Lagakos, Michalopoulos & Voth NBER Working Paper |
2025 | Large-scale qualitative data in economics (historical life histories) | Demonstrates appetite for qualitative-at-scale in top research. Market validation. | |
| Tier 3 | Tranchero et al. Working Paper |
2024 | LLMs for theory building from qualitative data | AI can help build theory from interview transcripts. Analysis-layer opportunity. |
The most rigorous validation of AI-moderated interviews to date. The authors built a single-agent LLM interviewing platform and tested it across five large-scale studies covering decision-making, political views, subjective well-being, and mental models of public policy. They benchmarked against face-to-face interviews conducted by trained sociologists, not students, not online moderators.
The platform uses a simple system prompt (no multi-agent architecture), supports both text and voice via native audio LLMs, and is released as open-source code. They tested GPT-4, GPT-4o, Llama, and other models.
Tested whether an LLM that adapts follow-up questions based on participant responses outperforms a "naive baseline" where follow-ups are always the same. This isolates the specific value of adaptive probing, the thing that distinguishes AI-moderated interviews from structured surveys.
Built generative agents that simulate 1,000 real individuals. Each agent is initialized with two hours of interview data and then matches the real person's survey responses, personality measures, and experimental behavior with meaningful fidelity.
Investigated a growing problem: respondents on platforms like Prolific and MTurk are using ChatGPT to generate their answers to open-ended survey questions. As LLMs improve, detecting this becomes harder.
Compared AI-led interviews against online interviews conducted by trained students, not expert sociologists. Focused on both content quality and respondent experience. This is the most commercially relevant comparator because most real-world research is conducted by relatively junior teams.