Job Description
Company: Deepgram
Job Type: Full-time, Remote
Location: Flexible / Remote
Company Overview
Deepgram is a pioneering leader in the Voice AI ecosystem, providing APIs for speech-to-text (STT), text-to-speech (TTS), and production-grade voice agents. Trusted by over 1,300 organizations including Twilio, Cloudflare, and Jack in the Box, Deepgram’s technology has processed over 50,000 years of audio and transcribed more than 1 trillion words.
At Deepgram, an AI-first mindset is essential. Team members actively integrate and experiment with AI tools in their workflows, continuously pushing the boundaries of what voice technology can achieve.
The Opportunity
Voice is the most natural way humans interact with machines, yet current AI models face challenges due to:
- Scarcity of diverse real-world audio
- High dimensionality of audio data
- Computational and storage limitations at scale
Deepgram is creating Latent Space Models (LSMs) to address these challenges, enabling:
- Next-generation neural audio codecs for extreme compression with high fidelity.
- Steerable generative models for human-like, expressive speech synthesis.
- Embedding systems to disentangle speaker, content, style, and environment.
- Synthetic audio data generation at massive scale for training multimodal speech systems.
- Real-time inference on hardware at scale with efficiency and robustness.
Role & Responsibilities
As a Research Staff member, you will:
- Pioneer development of Latent Space Models (LSMs) for robust Voice AI.
- Design and implement neural audio codecs, generative models, and embedding systems.
- Generate large-scale synthetic datasets through “latent recombination.”
- Develop model architectures, training schemes, and inference algorithms for real-world deployment.
- Conduct controlled experiments to validate theoretical insights and prototypes.
- Collaborate with a team of researchers to solve complex audio and AI problems.
Ideal Candidate Profile
We are looking for researchers who:
- Treat “unsolved” problems as opportunities for innovation.
- Can identify critical experiments and iterate quickly.
- Scale proofs-of-concept 100x with vision and creativity.
- Are obsessed with leveraging AI to automate and amplify impact.
Technical Qualifications:
- Strong foundation in statistical learning, self-supervised, and multimodal learning.
- Expertise in foundation model architectures and large-scale model training.
- Proven ability to bridge theory and practice in AI research.
- Experience with large-scale data pipelines, curating diverse datasets.
- Familiarity with hardware-aware optimizations for deployment.
- Track record in open-source contributions or published research in speech/language AI.
Key Skills & Competencies
- Statistical & mathematical foundations for model design.
- Algorithmic innovation and prototype implementation.
- Large-scale, data-driven system development.
- Hardware-aware optimization and real-time inference.
- Rigorous experimental design with robust evaluation metrics.
Foundational Papers Informing This Role
- Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)
- Moshi: Speech-Text Foundation Model for Real-Time Dialogue
- Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- Scaling Laws for Neural Language Models
- BASE TTS: Lessons from a Billion-Parameter Text-to-Speech Model
- Neural Discrete Representation Learning (VQ-VAE)
- SoundStream: End-to-End Neural Audio Codec
- Finite Scalar Quantization: Simplifying VQ-VAE
- Phi-3 Technical Report: Local LLM on Your Phone
- Transformers are SSMs: Efficient Structured State-Space Models
Benefits & Perks
Health & Wellness: Medical, dental, vision, wellness stipend, mental health support
Work/Life Balance: Unlimited PTO, flexible schedule, home office stipend, 12 paid US holidays
Learning & Development: Conference participation, education stipends, AI enablement workshops
Financial Benefits: Life, STD, LTD insurance plans, 401(k) with company match