Job Description
Location: Remote (USA Time Zones)
Company: Grafana Labs – Open Source Observability & Metrics Platform
Grafana Labs is a remote-first, open-source powerhouse with over 20M users globally. Our dashboards are used everywhere—from NASA launches to Wimbledon—and our managed Grafana Cloud services help 3,000+ companies scale observability with Mimir (metrics), Loki (logs), and Tempo (traces).
We’re looking for a Senior Backend Engineer to join our Managed Services team in the Databases department. You’ll own and operate production-critical infrastructure powering Grafana Cloud’s next-generation database products, including 100+ WarpStream clusters across multiple cloud providers.
🔹 What You’ll Be Doing
- Operate and evolve multi-cloud streaming clusters and database infrastructure.
- Diagnose and resolve cross-layer failure modes (object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions).
- Design safe upgrade and rollout strategies at scale.
- Improve observability, automation, and operational ergonomics.
- Collaborate with database and platform teams on scaling, partitioning, fan-out, and query performance.
- Work directly with distributed systems, Kubernetes scheduling, storage engines, compression trade-offs, etc.
- Serve as primary escalation and on-call for incidents.
- Manage vendor relationships for critical systems.
- Influence team approaches to reliability, scaling, and operational excellence.
- Leverage modern AI coding assistants (GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro) for prototyping, refactoring, testing, and documentation.
🔹 What Makes You a Great Fit
- 6+ years engineering experience in SRE, platform, production, or distributed systems roles.
- Experience operating distributed systems in production (e.g., Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, Cassandra).
- Strong Kubernetes experience in AWS, GCP, or Azure; familiarity with Helm, Terraform, Jsonnet, or similar IaC tools.
- Solid understanding of distributed systems design and large-scale system trade-offs.
- Proficiency in at least one programming language (Go preferred).
- Working knowledge of Linux internals, networking, cloud storage, and scaling/performance.
- Experience with blameless incident response and high-quality post-incident reviews.
- Clear communicator who can work independently and across teams.
- Curious, pragmatic, action-oriented, and kind.
🔹 Compensation & Rewards
- Base Salary: USD 154,445 – 185,334 (USA)
- Additional Rewards: RSUs, bonus (if applicable), and full benefits.
🔹 Why You’ll Thrive at Grafana Labs
- 100% Remote, Global Culture – Collaborate with talented people worldwide.
- High-Impact Work – Influence Grafana Cloud’s critical infrastructure.
- Transparency & Autonomy – Open decision-making, supportive environment.
- Innovation & AI Productivity Tools – Modern AI tools included in workflow.
- Career Growth – Defined pathways and mentoring opportunities.
- Work-Life Balance – 30 days annual leave plus 3 Grafana Shutdown Days.
- Open Source Roots & Inclusive Culture – Built on community-driven values and diversity.