Senior Software Engineer ||

Samsara

61d•Remote

About The Position

Samsara is hiring a Senior Software Engineer II to join our Operational Excellence (OPX) team within the Developer Experience organization. DevEx is responsible for the engineering environment that a globally distributed engineering org relies on every day, from build and deploy systems to development tooling and AI-assisted workflows that help teams move quickly and confidently. Within DevEx, the Operational Excellence (OPX) team keeps production healthy at scale. We provide engineering teams the platform capabilities, observability tooling, automated safeguards, incident management tooling, and safe feature release systems they need to deliver highly available systems, ship features with confidence, and investigate and mitigate incidents faster. OPX is focused on raising the bar for system stability, resilience, and reliability across Samsara -- building automated safeguards that protect production, ensuring engineers get the right signals at the right time, and giving teams the visibility they need to stay in control of their services. We're also investing in AI-driven operational tooling and partnering directly with product engineering teams to strengthen their operational posture. This is a remote position open to candidates residing in the ET time Zone in the US. Relocation assistance will not be provided for this role. In this role, you will: Design and build automated reliability and self-healing systems that protect production at scale, including automated rollbacks, deploy safeguards, and fault mitigation, and deliver them as platform tooling that engineering teams across the company adopt for their own services. Own and improve incident management tooling and on-call health. Reduce alert noise, surface actionable signals, and empower engineering teams to operate their services confidently with minimal operational burden. Develop and evolve our observability infrastructure, including monitoring, alerting, SLOs, and performance regression detection, to give teams real-time, actionable visibility into system health and latency. Contribute to AI-driven operational tooling that goes beyond triage, building toward autonomous remediation where AI detects issues, takes corrective action, and self-recovers with minimal human involvement. Drive incident prevention by identifying systemic patterns and ruthlessly eliminating operational toil. You have deep empathy for on-call engineers and a bias toward making their lives better. Partner directly with product engineering teams to diagnose reliability gaps, reduce their operational burden, and help them adopt best practices for running their services. Define and champion operational excellence best practices across engineering through guardrails, scorecards, and standards that help teams run their services reliably by default. Champion, role model, and embed Samsara’s cultural principles (Focus on Customer Success, Build for the Long Term, Adopt a Growth Mindset, Be Inclusive, Win as a Team) as we scale globally and across new offices.

Requirements

8+ years of experience designing and building products in a software engineering team.
Bachelor's Degree in Computer Science/Engineering or equivalent practical experience.
3+ years of experience in infrastructure and/or platform engineering-focused teams.
Expertise in Observability and reliability, operational metrics, and data analysis.
Proven track record in architecting monitoring frameworks, SLO platforms, and automated response workflows, Datadog (or equivalent observability tooling like New Relic, Grafana).
Proven experience working on large-scale enterprise software applications.
Experience in Developer Experience (DevEx) & Internal Portals: Designing and implementing solutions/tools that centralize and simplify engineering operations.
Familiarity with cloud platforms (AWS, GCP, or the like).
Experience in implementing AI-driven automation across the software development lifecycle (SDLC) to reduce developer friction, automate repetitive technical tasks, and accelerate time-to-delivery. Routinely applies AI tools across your workflow.
Familiarity with Experienced at writing high-quality code (Go, Python, or equivalent) focused on infrastructure, deployment, and operations challenges.
Experience mentoring and supporting engineers and role modeling engineering practices within a technical lead capacity.
Proactive growth mindset, always looking at ways to improve the status quo.

Nice To Haves

Strong communication skills and a desire to collaborate across teams.
Experience with incident management tooling (Incident.io, PagerDuty, or equivalent).
Experienced with Infrastructure as Code (Iac) - Terraform.

Responsibilities

Design and build automated reliability and self-healing systems that protect production at scale, including automated rollbacks, deploy safeguards, and fault mitigation, and deliver them as platform tooling that engineering teams across the company adopt for their own services.
Own and improve incident management tooling and on-call health. Reduce alert noise, surface actionable signals, and empower engineering teams to operate their services confidently with minimal operational burden.
Develop and evolve our observability infrastructure, including monitoring, alerting, SLOs, and performance regression detection, to give teams real-time, actionable visibility into system health and latency.
Contribute to AI-driven operational tooling that goes beyond triage, building toward autonomous remediation where AI detects issues, takes corrective action, and self-recovers with minimal human involvement.
Drive incident prevention by identifying systemic patterns and ruthlessly eliminating operational toil. You have deep empathy for on-call engineers and a bias toward making their lives better.
Partner directly with product engineering teams to diagnose reliability gaps, reduce their operational burden, and help them adopt best practices for running their services.
Define and champion operational excellence best practices across engineering through guardrails, scorecards, and standards that help teams run their services reliably by default.
Champion, role model, and embed Samsara’s cultural principles (Focus on Customer Success, Build for the Long Term, Adopt a Growth Mindset, Be Inclusive, Win as a Team) as we scale globally and across new offices.