Staff SRE Engineer

News CorporationAustin, TX
29d

About The Position

Recognized as the No. 1 site trusted by real estate professionals, Realtor.com has been at the forefront of online real estate for over 25 years, connecting buyers, sellers, and renters with trusted insights and expert guidance to find their perfect home. Through its robust suite of tools, Realtor.com not only makes a significant impact on the real estate industry at large, but for consumers, navigating the biggest purchase they will make in their life, by providing a user experience that is easy to use, easy to understand, and most of all, easy to make decisions. Join us on our mission to empower more people to find their way home by breaking barriers to entry, making the right connections, and building confidence through expert guidance. About the Role We are seeking a Staff Site Reliability Engineer to join our newly formed Operations Excellence organization, reporting to the Director, Operations Excellence. This foundational role will shape the reliability, observability, and operational excellence of our platform infrastructure serving millions of users. As a Staff SRE, you will be a technical leader and mentor who establishes best practices, drives architectural decisions, and enables our 600+ engineers to deliver exceptional customer experiences. You will work on critical platform systems including EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack, while establishing chaos engineering practices and driving cost optimization initiatives with measurable ROI.

Requirements

  • 8+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering withproven track record improving system reliability
  • Bachelor's degree or equivalent experience
  • 5+ years hands-on experience with AWS (EKS, EC2, RDS, S3, CloudWatch, IAM) andKubernetes including multi-cluster management
  • Strong programming skills (Python, Go, or Java) with infrastructure automation andInfrastructure as Code experience (Terraform, CloudFormation)
  • Production experience with observability tools (NewRelic, Datadog, Prometheus,Grafana, Splunk) and distributed systems architecture
  • Experience with CI/CD platforms and GitOps workflows (CircleCI, Argo CD, Jenkins);on-call rotation and high-severity incident response

Nice To Haves

  • Chaos engineering tools, API Gateway technologies (Tyk/Kong), GraphQLfederation (Apollo), cost optimization initiatives with measurable ROI, FinOps principles

Responsibilities

  • Design and maintain highly available AWS infrastructure including EKS clusters, Fargate(ECS), and multi-region architectures
  • Own reliability of critical services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (ApolloGraphQL), and supporting infrastructure
  • Establish SLIs, SLOs, and error budgets for Tier 1/2/3 systems; lead architecturalreviews for reliability and cost-efficiency
  • Drive adoption of reliability patterns including circuit breakers, graceful degradation, andautomated failoverObservability & Cost Optimization
  • Build comprehensive observability using NewRelic for APM, distributed tracing, metrics,and logging for rapid troubleshooting
  • Create actionable dashboards and alerts that reduce MTTD and MTTR; establishobservability standards across teams
  • Analyze infrastructure spend and implement FinOps practices including rightsizing,reserved capacity, and resource lifecycle management
  • Drive cost-conscious architecture decisions and optimize CI/CD spend (CircleCI, ArgoCD optimization)Chaos Engineering & Incident Response
  • Design chaos engineering experiments to identify system weaknesses; build frameworksfor safe production testing
  • Lead game day exercises and disaster recovery simulations; create runbooks andautomation for resilience
  • Participate in on-call rotation for critical systems; lead post-incident reviews and drivesystemic improvements
  • Mentor engineers on incident response, communication, and escalation; contribute toSystem Health ScorecardTechnical Leadership
  • Serve as technical leader and mentor for the growing Operations Excellence team;establish SRE principles and culture
  • Partner with Platform Engineering, Quality Engineering, and product teams on reliabilityinitiatives
  • Support security initiatives including AWS Secrets Manager migration and compliancerequirements (SOC 2, PCI, GDPR)
  • Contribute to Developer Experience metrics and platform adoption goalsWhat You'll BringExperience & Expertise

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Broadcasting and Content Providers

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service