About The Position

Grafana Labs is a remote-first, open-source company with a global collaborative culture. The Application Core Services (AppCore) team partners with Cloud, Enterprise, and Grafana teams to deliver reliable internal and customer-facing systems that power critical parts of the Grafana business. They build on the grafana.com platform to create custom solutions and integrations across various systems supporting a modern software company. The team owns important domain areas that ensure customer workflows and internal business processes run smoothly, including maintaining the billing engine, automating provisioning, integrating with cloud marketplaces, and building the user portal. This role is at the intersection of product, platform, and business operations, building systems critical for Grafana's scale. Engineers are encouraged to solve complex workflow and systems problems, improve reliability and developer experience, and build software supporting customers and internal stakeholders. Grafana Labs embraces AI-assisted development, encouraging engineers to leverage AI tools for faster prototyping, test generation, refactoring, documentation, and incident follow-ups, while maintaining strong code review and quality standards. This is a remote opportunity for applicants located in Canadian time zones (EST + CST only).

Requirements

  • At least 1 year of fully remote work experience.
  • Experience working on a big SaaS platform and dealing with common distributed systems problems (e.g., scalability, multi-tenancy, data isolation, HA).
  • Professional experience with Golang and willingness to work across both backend service and application code.
  • Care deeply about developer and user experience and the quality of the products worked on.
  • Some experience with delivering projects from gathering requirements, and brainstorming ideas to shipping a product to the customer’s hands in a self-driven way.
  • Write clean, robust, well-tested software that other engineers can understand, operate, and maintain.
  • Experience with mentoring junior engineers in a collaborative but asynchronous environment.
  • Ability to take on complex challenges and break them down to achieve tight learning loops: to analyze, design, and build modular solutions, deliver MVPs, gather data and feedback, and then progress iteratively.
  • Willingness to work across teams, aligning plans with the needs of other squads and external stakeholders, making plans transparent, bringing stakeholders on board, and being open to feedback and suggestions.
  • Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
  • Experience participating in blameless incident response and writing high-quality post-incident reviews.

Nice To Haves

  • Experience with TypeScript/Node.js.
  • Experience with Kubernetes control-plane patterns, operators, reconcilers, or desired-state systems.
  • Experience with Jsonnet/Tanka, Terraform, Flux, Argo, or similar deployment/configuration tooling.
  • Experience working on SaaS provisioning, tenancy, regional expansion, plugin rollout, or customer lifecycle systems.
  • Experience with incident response involving configuration drift, partial failure, or cross-service state mismatch.

Responsibilities

  • Design, build, and operate reconciliation systems, including the SSS backend, to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
  • Manage rollout mechanisms for provisioned plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration.
  • Support new region and cluster rollouts, including the operational paths required to bring stacks online safely in new Grafana Cloud regions.
  • Improve incident response and recovery paths for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures.
  • Partner with Product, Hosted Grafana, Infrastructure, Support, and adjacent AppCore squads on customer-impacting stack lifecycle work.
  • Contribute to roadmap planning, technical design, OnCall improvements, and long-term simplification of stack operations.
  • Own the production behavior of the systems built, including improving runbooks, dashboards, alerts, reconciliation safety, rollout controls, and recovery procedures.
  • Debug across service boundaries and make careful changes in systems affecting customer stacks.
  • Participate in a follow-the-sun OnCall rotation, working closely with counterparts in other regions for balanced coverage and shared ownership.
  • Write efficient, readable, and easy-to-maintain code.
  • Design new microservices or systems.
  • Collaborate with teammates and other departments to reach consensus on proposed solutions.
  • Coordinate with product and UX when needed.
  • Respond to customer requests and feedback.
  • Participate in team decisions, such as roadmap planning and prioritization.

Benefits

  • Equity
  • Bonus (if applicable)
  • Restricted Stock Units (RSUs)
  • Global annual leave policy of 30 days per annum
  • 3 days of annual leave entitlement reserved for Grafana Shutdown Days
  • Access to modern AI coding assistants with a company-funded usage budget
  • Access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6, Gemini 3 Pro)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service