Job Description
A global retail and technology leader, Walmart, is seeking a Principal Data Scientist to join its Next Gen Commerce team in Sunnyvale, California. This high-impact leadership role focuses on building the future of conversational shopping through intelligent AI agents that reason, recommend, and proactively assist customers. You will serve as the technical authority for defining, measuring, and improving AI quality using advanced evaluation frameworks, LLM-as-a-judge systems, and automated pipelines.
The ideal candidate will bring deep expertise in Generative AI, large language models, and evaluation methodologies, along with strong hands-on technical leadership. You will collaborate closely with engineering and product teams to translate subjective quality objectives into measurable metrics that drive continuous model improvement and safe deployment at scale.
Key Responsibilities
Design and implement advanced evaluation architectures for conversational AI systems using hybrid scoring and LLM-as-a-judge frameworks
Develop high-precision prompts for evaluator models and calibrate them against human benchmarks to ensure reliability
Lead model distillation and optimization efforts to create scalable and cost-efficient evaluation models
Curate large-scale datasets and “Golden Set” benchmarks from conversational logs to standardize evaluation processes
Integrate quality metrics into CI/CD pipelines for automated regression testing and production monitoring
Conduct deep failure analysis on AI agents, including hallucinations, safety risks, and tool misuse
Leverage evaluation insights to influence modeling teams and prioritize system improvements
Mentor senior data scientists and establish best practices for AI evaluation across the organization
Contribute thought leadership through research publications, patents, or conference presentations
Required Qualifications
Advanced degree (Master’s or PhD) in Computer Science, Statistics, Mathematics, Computational Linguistics, or related field
7+ years of experience in data science or machine learning with a focus on NLP, deep learning, or AI evaluation
Strong expertise in Large Language Models, prompt engineering, and instruction tuning
Proficiency in Python and core ML libraries such as PyTorch, NumPy, Pandas, and Scikit-learn
Experience designing evaluation metrics for non-deterministic AI outputs such as summarization or conversational responses
Knowledge of scalable data pipelines and distributed ML systems
Preferred Qualifications
PhD in Machine Learning, NLP, or a related quantitative discipline
Experience with conversational AI, retrieval-augmented generation (RAG), or recommendation systems in e-commerce environments
Knowledge of model distillation, LoRA, parameter-efficient tuning, or instruction optimization techniques
Publications, patents, or open-source contributions in AI or LLM evaluation
Familiarity with subjective evaluation frameworks for open-ended AI outputs
Compensation & Benefits
The position offers a competitive annual salary ranging from $143,000 to $286,000, along with performance bonuses, stock opportunities, and a comprehensive benefits package. Benefits include medical, dental, and vision coverage, retirement plans, paid time off, parental leave, disability coverage, employee discounts, and education assistance programs.