Job Description
Location: Sunnyvale, California, USA
Team: Next Gen Commerce
Overview
Walmart’s Next Gen Commerce team is building the future of conversational shopping through intelligent AI agents that can reason, recommend, and proactively assist customers.
As a Principal Data Scientist for Quality & LLM Judging Systems, you will serve as the technical leader responsible for defining and measuring the success of conversational AI systems. You will design advanced evaluation architectures that combine LLM-as-a-judge frameworks, human benchmarks, and automated pipelines to drive model improvement and safe deployment.
This is a high-impact, hands-on leadership role partnering with engineering and product teams to convert subjective quality goals into measurable, actionable metrics.
Key Responsibilities
Evaluation Architecture
- Design and implement state-of-the-art evaluation pipelines for conversational AI agents.
- Develop hybrid scoring frameworks combining automated metrics, LLM judges, and human evaluation.
Prompt Engineering & Calibration
- Create high-precision prompts for evaluator models.
- Validate outputs against human judgment to ensure reliability and consistency.
Model Optimization & Distillation
- Fine-tune smaller, cost-efficient models to act as scalable judge systems.
- Optimize trade-offs between accuracy, latency, and cost.
Dataset Development
- Curate “Golden Set” datasets from large-scale conversation logs.
- Define annotation guidelines and ground truth standards for subjective evaluation tasks.
Engineering Integration
- Integrate quality metrics into CI/CD pipelines for automated testing and monitoring.
- Collaborate with engineering teams on production deployment and scalability.
Failure Analysis & Continuous Improvement
- Analyze failure modes such as hallucinations, tool misuse, and safety violations.
- Build feedback loops to improve modeling performance across teams.
Strategic Leadership
- Mentor senior data scientists and standardize evaluation best practices.
- Influence modeling priorities across cross-functional teams through data-driven insights.
- Contribute to patents, publications, or conference presentations.
Minimum Qualifications
- Master’s or PhD in Computer Science, Statistics, Mathematics, Computational Linguistics, or related field.
- 7+ years of experience in data science or machine learning with focus on NLP or deep learning.
- Deep expertise in large language models (LLMs), prompt engineering, and instruction tuning.
- Strong Python skills with libraries such as NumPy, Pandas, PyTorch, and Scikit-learn.
- Experience designing evaluation metrics for non-deterministic AI outputs.
- Knowledge of scalable data pipelines and distributed ML systems.
Preferred Qualifications
- PhD in Machine Learning, NLP, or related field.
- Experience with conversational AI, RAG systems, chatbots, or recommendation evaluation.
- Familiarity with LoRA, parameter-efficient fine-tuning, or model distillation techniques.
- Experience evaluating subjective or open-ended outputs.
- Publications, patents, or open-source contributions in AI/LLM evaluation.
Compensation & Benefits
- Salary Range: $143,000 – $286,000 annually
- Performance bonuses and stock opportunities.
- Comprehensive medical, dental, and vision insurance.
- 401(k) with company match.
- Paid time off, parental leave, and disability coverage.
- Education benefits through Walmart’s Live Better U program.
- Employee discounts and additional financial wellness benefits.
Why This Role is Unique
This position sits at the intersection of:
- Generative AI research
- Production ML systems
- Conversational commerce innovation
- AI safety and evaluation science
You will directly shape how millions of customers interact with AI-powered shopping experiences.