/ˈsaɪ.mən/Simon /ˈɹoʊ.zən/Rosen

Senior ML Data Linguist building the evaluation architecture behind conversational AI — rubrics, LLM-as-judge pipelines, and the science of how a synthetic voice should sound.

Voice sample — synthesized · 0:09
Broadband spectrogram, 0–5 kHz · formants F1–F4 · pitch 75–500 Hz · ToBI tone, IPA, word & break tiers · click a word or drag a span to hear it

I work at the boundary between language and machine learning — designing the rubric frameworks, judge pipelines, and ground-truth benchmarks that determine whether voice and text AI agents actually work.

As a linguist, I decompose fuzzy notions like “naturalness” and “helpfulness” into scorable dimensions, then build the human-calibration and measurement pipelines that hold models to them.

Senior ML Data Linguist
ServiceNow · Santa Clara, CA
Feb 2026 — Present
  • Leading the design and execution of a ground-truth evaluation pipeline for agentic conversations — constructing a human-labeled benchmark dataset and validating it against LLM-as-judge and automatic metrics to assess model performance across dimensions.
  • Defining evaluation strategy for multi-turn agentic workflows — incident creation, HR case management, tool-calling flows — across voice and text modalities in a cascade framework.
  • Owning rubric standards and structured scoring criteria for response quality, instruction adherence, grounding fidelity, and conversational coherence.
  • Directing annotator calibration through rubric walkthroughs, adjudication workflows, and targeted re-evaluation cycles to maintain inter-rater reliability on high-ambiguity tasks.
ML Data Linguist
ServiceNow · Santa Clara, CA
Jan 2023 — Feb 2026
  • Authored LLM-as-judge prompts with structured output schemas integrated into internal evaluation tooling; wrote and published a prompt design principles guide adopted as a team-wide reference.
  • Designed and maintained five rubric and evaluation schema standards used across annotation teams for multi-turn conversational AI assessment.
  • Oversaw synthetic data generation pipelines to augment evaluation coverage and stress-test model behavior on underrepresented scenarios.
  • Owned annotation guidelines and ran recurring calibration sessions with reviewers, establishing consistent scoring on edge cases across enterprise workflow domains.
  • Cataloged systematic failure patterns — hallucination, under-specification, intent misclassification — and partnered with ML engineers to drive targeted retraining, prompt refinement, and production model selection.
Under review — ACL 2026 Beyond Naturalness: Probing Automated Text-to-Speech Evaluators on Linguistically Grounded Dimensions — designed the 10-dimension linguistically grounded schema and led the annotation of the 640-utterance dataset behind the first dimension-level meta-evaluation benchmark for TTS, used to audit MOS predictors and audio-LLM judges against human perception.

Evaluation & Metrics

  • LLM-as-judge prompt design
  • rubric architecture
  • multi-turn conversational evaluation
  • ground-truth dataset development
  • conversational failure taxonomy design
  • alignment assessment

Agent Systems

  • tool-calling evaluation
  • agentic workflow assessment
  • instruction adherence scoring
  • task completion analysis
  • multi-step agent audit

Speech & Language

  • prosody analysis
  • segmental & suprasegmental evaluation
  • TTS / ASR output assessment
  • pragmatics
  • discourse analysis

Operations & Collaboration

  • annotation guideline authoring
  • reviewer calibration protocols
  • inter-annotator agreement (Krippendorff’s α)
  • edge case adjudication
  • failure impact prioritization
  • cross-functional ML partnership

Technical

  • Python
  • structured prompt engineering
  • output schema design
  • YAML
  • Git
  • JSON schema design
2022
B.A. in Linguistics
Princeton University · Focus: Theoretical & Computational Linguistics
Thesis Tone Sandhi and Constituency in Classifier Phrases Across Sinitic Languages
phonetics & phonology · syntax · semantics · pragmatics · java programming · data structures & algorithms

Working on conversational AI evaluation, speech quality, or agent benchmarking? Let’s talk.