Case Study · Human Factors Research · 2025

Fitting an ear,
at scale.

Large-scale human factors research for the AirPods product line — designing the study, analyzing the data, and translating findings into design decisions that measurably improved the next generation.

Role

company

location

timeline

Human Factors Researcher

Apple (via Sasken Inc.)

Austin, TX

Jan 2025 - Present

This case study covers my research process, methodology, and outcomes only. Product-specific details, internal materials, and findings are covered by NDA and are not included here.

01 / overview

A research program built to turn complaints into specs.

I lead large-scale human factors studies for the AirPods product team, evaluating device fit, comfort, and motion stability under real-world usage conditions. With 400+ participants across 3-hour sessions, the work sits at the intersection of physical ergonomics and behavioral research — translating large, complex datasets into actionable direction for hardware and design teams.

The goal was to move from anecdotal complaints about the prior generation to rigorous, quantified evidence that could directly drive spec changes. Findings were synthesized into executive narratives and design recommendations that influenced the product roadmap.

participants

400 +

Real-world conditions

Session length

3 hrs

Per participant

Survey items

120+

Per instrument

Failure modes

Fit, motion, comfort

02 / Study design

A 120-question instrument designed to not lie.

Designing a rigorous survey at this scale means making a lot of decisions that most people never see. I built the 120+ question instrument around four constraints: question logic that skipped irrelevant items based on prior responses, calibrated scaling standards to prevent common response biases, explicit bias controls including reverse-coded items and balanced framing, and pilot testing before deployment to catch clarity and load issues.

Each 3-hour session was structured to capture both objective and subjective signals — ensuring qualitative feedback was always grounded in measurable usage conditions, and that the sequence of methods didn't prime participants toward particular responses.

What I designed

Purpose

Key methodological choices

Survey instrument

Capture preference, attribution, and behavioral intent post-session

Question logic, scaling standards, bias controls, pilot iteration

Session protocol

Structure 3-hour participant sessions consistently across 400+ people

Objective measures before subjective scales; order controls

Synthesis framework

Translate mixed-methods data into executive narratives

Qualitative + quantitative integration; R for survey analysis

03 / Analysis

Learning R to listen to the data better.

Running studies at this scale generates more data than spreadsheets can handle cleanly. I taught myself R on the job to analyze the survey responses properly — calculating failure rates across dimensions, summarizing Likert-scale distributions, and identifying patterns across participant groups. Writing the scripts myself meant I understood every decision in the analysis, rather than handing the data off and interpreting someone else's output.

Alongside the quantitative analysis, I built lightweight LLM-assisted workflows to automate parts of the process that were previously manual — including qualitative response coding, survey summarization, participant image anonymization, and report generation. This freed time for the higher-order work: interpreting what the data meant and translating it into recommendations teams could actually use.

R — survey analysis

Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.

LLM — qualitative coding

Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.

Python — image anonymization

Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.

LLM — report & deck generation

Failure rate calculation, Likert distribution summaries, and cross-group pattern analysis on 120+ question survey data across 400+ participants.

The R scripts and automation workflows are documented in more detail as standalone mini projects — links to follow when published.

04 / Impact

Numbers that moved the spec.

Research findings were synthesized into clear executive narratives and actionable design recommendations that directly influenced product roadmap decisions. Validation against the prior generation showed measurable improvements across all three primary failure modes.

Failure incidence — prior generation vs. next-gen

Fit failures

10% → 6% −40%

Motion stability failures

5% → 1% −80%

Comfort failures

7% → 3% −57%

05 / Cross-functional work

Research only lands if the right people act on it.

Findings don't move products on their own — they need to be translated into the language of the teams who make decisions. I partnered closely with product and engineering throughout, not just at the readout stage, to make sure user needs were reflected at every stage of the development cycle.

Function

How research connected to their work

Product Management

Executive narratives and roadmap recommendations

Engineering

Quantified failure data translated into actionable design direction

Design Team

Qualitative feedback and comfort findings informing form decisions

06 / Reflection

What I would do differently.

What worked

Building the analysis tools myself

Writing my own R scripts meant I understood every decision in the data. The analysis wasn't a black box — I could defend every number in a readout because I'd built it.

What I'd change

Start automating earlier

The LLM workflows came mid-study. Building them before the first wave of data would have made the early synthesis much faster and let me iterate on the codebook sooner.

What surprised me

How much translation the job requires

The hardest part wasn't the research — it was turning dense mixed-methods data into something an engineer or PM could act on without losing the nuance. That skill took deliberate practice.

What's next

Applying the same framework to over-ear

The same study design was applied to AirPods Max — shifting the ergonomic focus to headband pressure and clamping force across head widths. Documented in the addendum below.

Addendum · Follow-on study

The same questions, a different form factor.

AirPods Max · Over-ear ergonomics · 2025

This addendum is an overview only. All product-specific details, findings, and materials are covered by NDA.

Following the in-ear AirPods study, I applied the same research framework to AirPods Max — shifting the ergonomic focus from canal morphology to head geometry, headband pressure distribution, and long-wear comfort. The core research questions remained structurally the same: where does fit fail, what motion dislodges it, and when does comfort degrade?

‍What changed. Over-ear introduces a different comfort curve and different failure modes. The session protocol and survey instrument were adapted to reflect the physical differences of the form factor — different motion classes, different pressure points, different wear-time thresholds.

‍What carried over. The survey architecture, bias controls, objective-before-subjective sequencing, R-based analysis approach, and cross-functional brief format were all carried directly from the in-ear program — validating the generalizability of the methodology across form factors.

Fitting an ear,at scale.

A research program built to turn complaints into specs.

A 120-question instrument designed to not lie.

Learning R to listen to the data better.

Numbers that moved the spec.

Research only lands if the right people act on it.

What I would do differently.

The same questions, a different form factor.

Fitting an ear,
at scale.