(USA) Principal, Data Scientist | Conversational AI

Filled
February 24, 2026

Job Description

Location: Sunnyvale, California, USA
Team: Next Gen Commerce

Overview

Walmart’s Next Gen Commerce team is building the future of conversational shopping through intelligent AI agents that can reason, recommend, and proactively assist customers.

As a Principal Data Scientist for Quality & LLM Judging Systems, you will serve as the technical leader responsible for defining and measuring the success of conversational AI systems. You will design advanced evaluation architectures that combine LLM-as-a-judge frameworks, human benchmarks, and automated pipelines to drive model improvement and safe deployment.

This is a high-impact, hands-on leadership role partnering with engineering and product teams to convert subjective quality goals into measurable, actionable metrics.

Key Responsibilities

Evaluation Architecture

  • Design and implement state-of-the-art evaluation pipelines for conversational AI agents.
  • Develop hybrid scoring frameworks combining automated metrics, LLM judges, and human evaluation.

Prompt Engineering & Calibration

  • Create high-precision prompts for evaluator models.
  • Validate outputs against human judgment to ensure reliability and consistency.

Model Optimization & Distillation

  • Fine-tune smaller, cost-efficient models to act as scalable judge systems.
  • Optimize trade-offs between accuracy, latency, and cost.

Dataset Development

  • Curate “Golden Set” datasets from large-scale conversation logs.
  • Define annotation guidelines and ground truth standards for subjective evaluation tasks.

Engineering Integration

  • Integrate quality metrics into CI/CD pipelines for automated testing and monitoring.
  • Collaborate with engineering teams on production deployment and scalability.

Failure Analysis & Continuous Improvement

  • Analyze failure modes such as hallucinations, tool misuse, and safety violations.
  • Build feedback loops to improve modeling performance across teams.

Strategic Leadership

  • Mentor senior data scientists and standardize evaluation best practices.
  • Influence modeling priorities across cross-functional teams through data-driven insights.
  • Contribute to patents, publications, or conference presentations.

Minimum Qualifications

  • Master’s or PhD in Computer Science, Statistics, Mathematics, Computational Linguistics, or related field.
  • 7+ years of experience in data science or machine learning with focus on NLP or deep learning.
  • Deep expertise in large language models (LLMs), prompt engineering, and instruction tuning.
  • Strong Python skills with libraries such as NumPy, Pandas, PyTorch, and Scikit-learn.
  • Experience designing evaluation metrics for non-deterministic AI outputs.
  • Knowledge of scalable data pipelines and distributed ML systems.

Preferred Qualifications

  • PhD in Machine Learning, NLP, or related field.
  • Experience with conversational AI, RAG systems, chatbots, or recommendation evaluation.
  • Familiarity with LoRA, parameter-efficient fine-tuning, or model distillation techniques.
  • Experience evaluating subjective or open-ended outputs.
  • Publications, patents, or open-source contributions in AI/LLM evaluation.

Compensation & Benefits

  • Salary Range: $143,000 – $286,000 annually
  • Performance bonuses and stock opportunities.
  • Comprehensive medical, dental, and vision insurance.
  • 401(k) with company match.
  • Paid time off, parental leave, and disability coverage.
  • Education benefits through Walmart’s Live Better U program.
  • Employee discounts and additional financial wellness benefits.

Why This Role is Unique

This position sits at the intersection of:

  • Generative AI research
  • Production ML systems
  • Conversational commerce innovation
  • AI safety and evaluation science

You will directly shape how millions of customers interact with AI-powered shopping experiences.