Senior Site Reliability Engineer

Fieldguide•San Francisco, CA

1d•Remote

About The Position

As a Senior Site Reliability Engineer (SRE) at Fieldguide, you will be responsible for ensuring the reliability, scalability, and observability of our production systems. You will apply software engineering principles to infrastructure and operations, designing systems that are resilient, highly available, and capable of scaling with rapid growth. You’ll work closely with product and platform engineering teams to define and implement reliability standards, improve system performance, and build robust observability practices. This role is central to maintaining a high level of trust in our systems by proactively identifying risks, reducing toil through automation, and driving operational excellence.

Requirements

5+ years of experience in site reliability engineering, infrastructure, or a related software engineering discipline.
Strong experience operating and scaling distributed systems in cloud environments, with AWS preferred.
Hands-on experience building and managing observability platforms (e.g., Datadog, Prometheus, Grafana, CloudWatch).
Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities.
Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent.
Deep understanding of system performance, reliability patterns, and distributed system failure modes.
Experience supporting production systems through on-call rotations and incident response.
Proficiency in at least one programming or scripting language used for automation and tooling.
Strong communication and collaboration skills, with the ability to work effectively across engineering and product teams.

Nice To Haves

Experience implementing distributed tracing systems, such as OpenTelemetry or similar frameworks.
Experience with capacity planning and performance benchmarking at scale.
Familiarity with database performance tuning and observability across high-traffic systems.
Exposure to regulated or compliance-heavy engineering environments (e.g., SOC 2, FedRAMP, or equivalent frameworks).
Experience applying chaos engineering practices to proactively test and strengthen system resilience.

Responsibilities

Design and operate highly scalable, fault-tolerant systems that support production workloads across a distributed cloud environment.
Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to guide reliability decisions.
Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance.
Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning.
Automate operational processes to reduce manual toil and improve system consistency and resilience.
Partner with engineering teams to design systems with reliability and scalability built in from the start.
Participate in and improve incident response, on-call practices, and post-incident reviews, focusing on root cause analysis and systemic improvements.
Drive continuous improvement of system resilience, including disaster recovery and chaos testing.
Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues.
Advocate for reliability-focused engineering culture, including blameless postmortems and operational excellence.