Senior Site Reliability Engineer

nemetschek

174d•$129,400 - $177,850

About The Position

The Senior Site Reliability Engineer (SRE) ensures the reliability, performance and scalability of shared services that support multiple product teams. This role partners closely with engineering teams to design resilient cloud architectures, enhance observability and drive automation that improves system operations and incident response.

Requirements

7–10 years in SRE/Platform/Production Engineering or closely related roles at scale.
5+ years operating production workloads on AWS.
5+ years with Go and 3+ years with .NET in production (at least one as primary).
5+ years implementing/operating observability at scale (Dynatrace preferred) and PagerDuty.
5+ years with Kubernetes (EKS) and Terraform in production.
5+ years designing/maintaining CI/CD (GitHub/GitLab) and progressive delivery.
Previous work experience with application and software development.
Experience developing customer-facing interfaces.
Advanced AWS production operations across computer, networking, storage, databases, and managed services named above.
Expert observability with Dynatrace and Grafana: metrics/traces/logs, distributed tracing, dashboards, anomaly detection, alert tuning, SLI/SLO reporting; PagerDuty integration.
Strong Go and solid .NET experience for platform automation, services, and reliability tooling.
Kubernetes (EKS), serverless (Lambda), and legacy EC2 operations; traffic management and progressive delivery.
Terraform (medium–high maturity), multi-account patterns, and policy-as-code.
CI/CD with GitLab and GitHub; secure release practices and progressive deployment strategies.
Deep understanding of distributed systems, networking (HTTP, gRPC, DNS, TLS, load balancing, caching), and data pipelines.
Incident management excellence focused on MTTR reduction and preventive engineering.
Security-aware operations (IAM, WAF, Shield) and compliance-minded delivery (SOC2; early FedRAMP).

Responsibilities

Design and evolve reliable, secure, cost-efficient AWS architectures (EKS, Lambda, EC2, ALB/NLB, RDS/Aurora, DynamoDB, S3, CloudFront, MSK/Kinesis, OpenSearch) for shared services.
Mature observability: define/maintain SLIs/SLOs, standardize traces/metrics/logs, build dashboards, and implement proactive alerting in Dynatrace with PagerDuty.
Build reliability tooling and automation in Go and .NET; contribute performance improvements and operational runbooks with dev teams.
Implement and maintain Terraform for multi-account AWS, reusable modules, policy-as-code, and drift detection.
Improve GitHub/GitLab CI/CD pipelines, enabling blue/green and canary with automated quality gates and safe rollback.
Lead incident response: on-call (follow-the-sun; Central/Pacific focus), triage, RCA, blameless postmortems, and preventive fixes.
Drive capacity planning, performance testing, chaos/failure-mode reviews, and resilience patterns (timeouts, retries, circuit breaking).
Partner with security to apply IAM least privilege and WAF/Shield; support SOC2 controls and early-stage FedRAMP readiness.
Establish high-quality runbooks and operational standards in Confluence; mentor engineers on SRE practices and error-budget thinking.
Promote FinOps: right-sizing, autoscaling, lifecycle policies, and cost visibility.

Benefits

People-focused, entrepreneurial culture with the backing of a stable, global, corporate entity – Nemetschek.
Competitive compensation and benefits package.
100% paid medical premiums for employees, 80% paid for dependents.
Fully vested 401K right from the day you start.
Generous PTO, including sick/mental health & volunteer days.
Free & unlimited access to BetterUp Care, a well-being platform.
Work-life balance fostered through a culture of diversity, inclusion, and appreciation of individual lifestyle needs.
Opportunity for continuous professional development.
Free & unlimited access to LinkedIn Learning.
Up to $5K annual education reimbursement (after 1 year tenure).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Number of Employees

1,001-5,000 employees

Senior Site Reliability Engineer

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company