Principal Site Reliability Engineer

nemetschek
91d$153,800 - $211,400

About The Position

At Bluebeam, we empower people to advance the way the world is built. We create smart software solutions that make construction sites more efficient, connected and safe and improve the lives of design and construction professionals everywhere. A Principal Site Reliability Engineer (SRE) drives architectural reliability, scalability, and operational excellence for shared services across our entire engineering landscape. This role sets technical vision, champions systemic reliability strategies, and leads transformative initiatives partnering with multiple product, infrastructure, and security teams. Principal SRE embodies thought leadership, mentors senior and staff engineers, and ensure our platforms are robust and future ready.

Requirements

  • BS in Computer Science (or equivalent), MS/PhD preferred.
  • Deep expertise in AWS, Kubernetes, serverless, distributed databases, and related cloud-native technologies.
  • Demonstrated leadership in crossteam architecture, cloud governance, security, and compliance disciplines.
  • Expert in multi-layered observability (Dynatrace, Grafana, OpenTelemetry), alerting design, and real-time reliability analytics.
  • Proven track record mentoring senior engineers, leading technical communities, and establishing best practices at scale.
  • Outstanding written and verbal communication skills.
  • 10+ years of hands-on SRE/Production Engineering experience.
  • 10+ years in SRE/Platform/Production Engineering or closely related roles at scale.
  • 7+ years with Go and 5+ years with .NET in production.
  • 7+ years implementing/operating observability at scale.
  • 7+ years with Kubernetes (EKS) and Terraform in production.
  • 7+ years designing/maintaining CI/CD (GitHub/GitLab) and progressive delivery.

Nice To Haves

  • Large-scale cloud migration, service mesh deployment, multi-account AWS governance, and FinOps practices.
  • Leadership in architecture standards, cross-team reliability roadmaps, and defining organizational SRE metrics.
  • Experience contributing to regulatory compliance, policy-as-code, and cloud security automation.

Responsibilities

  • Architect, govern, and continuously improve highly reliable, cost-effective, and secure AWS and hybrid-cloud environments.
  • Define and evolve global reliability strategy: lead high-impact initiatives on observability, service level objectives, automated operations, and incident management standards.
  • Develop and champion enterprise-wide SRE best practices, including error budget adoption, complex system reviews, and mission-critical runbooks.
  • Advance organization-wide observability, including telemetry ingestion pipelines, OpenTelemetry adoption, deep distributed tracing, anomaly detection, and actionable alerting.
  • Lead deep technical investigations, large-scale incident responses, and root-cause analysis for critical issues.
  • Mentor engineers and foster technical growth through workshops, architecture reviews, and targeted coaching.
  • Partner with product, platform, security, and DevOps leaders to influence roadmap, compliance, cloud governance, and early-stage preparations for certifications.
  • Spearhead infrastructure expansion, cloud migration, reliability automation, and service mesh architectures for multi-region, multi-account, and compliance-sensitive deployments.
  • Own strategic improvements to CI/CD, blue-green/canary deployments, and resilience patterns.

Benefits

  • 100% paid medical premiums for employees, 80% paid for dependents.
  • Fully vested 401K right from the day you start.
  • Generous PTO, including sick/mental health & volunteer days.
  • Free & unlimited access to BetterUp Care, a well-being platform.
  • Opportunity for continuous professional development.
  • Free & unlimited access to LinkedIn Learning.
  • Up to $5K annual education reimbursement (after 1 year tenure).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service