Large-scale human factors research for the AirPods product line — designing the study, analyzing the data, and translating findings into design decisions that measurably improved the next generation.
Human Factors Researcher
Apple (via Sasken Inc.)
Austin, TX
Jan 2025 - Present
This case study covers my research process, methodology, and outcomes only. Product-specific details, internal materials, and findings are covered by NDA and are not included here.
I lead large-scale human factors studies for the AirPods product team, evaluating device fit, comfort, and motion stability under real-world usage conditions. With 400+ participants across 3-hour sessions, the work sits at the intersection of physical ergonomics and behavioral research — translating large, complex datasets into actionable direction for hardware and design teams.
The goal was to move from anecdotal complaints about the prior generation to rigorous, quantified evidence that could directly drive spec changes. Findings were synthesized into executive narratives and design recommendations that influenced the product roadmap.
400 +
Real-world conditions
3 hrs
Per participant
120+
Per instrument
3
Fit, motion, comfort
Designing a rigorous survey at this scale means making a lot of decisions that most people never see. I built the 120+ question instrument around four constraints: question logic that skipped irrelevant items based on prior responses, calibrated scaling standards to prevent common response biases, explicit bias controls including reverse-coded items and balanced framing, and pilot testing before deployment to catch clarity and load issues.
Each 3-hour session was structured to capture both objective and subjective signals — ensuring qualitative feedback was always grounded in measurable usage conditions, and that the sequence of methods didn't prime participants toward particular responses.
Survey instrument
Capture preference, attribution, and behavioral intent post-session
Question logic, scaling standards, bias controls, pilot iteration
Session protocol
Structure 3-hour participant sessions consistently across 400+ people
Objective measures before subjective scales; order controls
Synthesis framework
Translate mixed-methods data into executive narratives
Qualitative + quantitative integration; R for survey analysis
Running studies at this scale generates more data than spreadsheets can handle cleanly. I taught myself R on the job to analyze the survey responses properly — calculating failure rates across dimensions, summarizing Likert-scale distributions, and identifying patterns across participant groups. Writing the scripts myself meant I understood every decision in the analysis, rather than handing the data off and interpreting someone else's output.
Alongside the quantitative analysis, I built lightweight LLM-assisted workflows to automate parts of the process that were previously manual — including qualitative response coding, survey summarization, participant image anonymization, and report generation. This freed time for the higher-order work: interpreting what the data meant and translating it into recommendations teams could actually use.
R — survey analysis
Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.
LLM — qualitative coding
Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.
Python — image anonymization
Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.
LLM — report & deck generation
Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.
The R scripts and automation workflows are documented in more detail as standalone mini projects — links to follow when published.
Research findings were synthesized into clear executive narratives and actionable design recommendations that directly influenced product roadmap decisions. Validation against the prior generation showed measurable improvements across all three primary failure modes.
Fit failures

10% → 6% −40%
Motion stability failures

5% → 1% −80%
Comfort failures

7% → 3% −57%
Findings don't move products on their own — they need to be translated into the language of the teams who make decisions. I partnered closely with product and engineering throughout, not just at the readout stage, to make sure user needs were reflected at every stage of the development cycle.
Product Management
Executive narratives and roadmap recommendations
Engineering
Quantified failure data translated into actionable design direction
Design Team
Qualitative feedback and comfort findings informing form decisions
Building the analysis tools myself
Writing my own R scripts meant I understood every decision in the data. The analysis wasn't a black box — I could defend every number in a readout because I'd built it.
Start automating earlier
The LLM workflows came mid-study. Building them before the first wave of data would have made the early synthesis much faster and let me iterate on the codebook sooner.
How much translation the job requires
The hardest part wasn't the research — it was turning dense mixed-methods data into something an engineer or PM could act on without losing the nuance. That skill took deliberate practice.
Applying the same framework to over-ear
The same study design was applied to AirPods Max — shifting the ergonomic focus to headband pressure and clamping force across head widths. Documented in the addendum below.
This addendum is an overview only. All product-specific details, findings, and materials are covered by NDA.
Following the in-ear AirPods study, I applied the same research framework to AirPods Max — shifting the ergonomic focus from canal morphology to head geometry, headband pressure distribution, and long-wear comfort. The core research questions remained structurally the same: where does fit fail, what motion dislodges it, and when does comfort degrade?
What changed. Over-ear introduces a different comfort curve and different failure modes. The session protocol and survey instrument were adapted to reflect the physical differences of the form factor — different motion classes, different pressure points, different wear-time thresholds.
What carried over. The survey architecture, bias controls, objective-before-subjective sequencing, R-based analysis approach, and cross-functional brief format were all carried directly from the in-ear program — validating the generalizability of the methodology across form factors.