Principal Cloud Reliability Engineer

T. Rowe PriceOwings Mills, MD
1dHybrid

About The Position

At T. Rowe Price, we identify and actively invest in opportunities to help people thrive in an evolving world. As a premier global asset management organization with more than 85 years of experience, we provide investment solutions and a broad range of equity, fixed income, and multi-asset capabilities to individuals, advisors, institutions, and retirement plan sponsors. We take an active, independent approach to investing, offering our dynamic perspective and meaningful partnership so our clients can feel more confident. We believe doing the right thing for our clients and our associates is good business. With a career at the firm, you can expect opportunities to create real impact at work and in your community. You’ll enjoy resources to support your career path, as well as compensation, benefits, and flexibility to enrich your life. Here, you’ll find a collaborative culture that respects and values differences and colleagues who share a spirit of generosity. Join us for the opportunity to grow and make a difference in ways that matter to you. Role Summary Cloud Reliability operates as a center of excellence for cloud and reliability engineering, delivering secure, scalable, and resilient cloud platforms to teams across the firm. The Principal Cloud Reliability Engineer is a senior individual contributor responsible for defining, building, and operating enterprise cloud foundations, with a strong emphasis on reliability, observability, and operational excellence. This role combines enterprise technical authority with hands‑on execution, ensuring that cloud platforms—particularly AWS Landing Zone (ALZ)—enable teams to deliver end‑to‑end applications that meet the firm’s availability, resilience, and risk expectations. The Principal serves as a design authority, SRE leader, and escalation point for complex cloud and reliability challenges.

Requirements

  • Bachelor's degree or the equivalent combination of education and relevant experience AND 10+ years of experience designing and operating cloud infrastructure with senior‑level impact.
  • Deep hands‑on experience with AWS.
  • Expert knowledge of: Cloud infrastructure. container platforms and serverless deployments. Reliability engineering concepts (availability, resilience, observability). Operating systems, networking, identity and access management. Infrastructure automation, CI/CD pipelines, and DevSecOps practices.
  • Proven ability to design, build, and operate enterprise‑scale, highly reliable cloud platforms.
  • Strong troubleshooting skills across infrastructure, platform, and reliability layers.
  • Experience mentoring engineers and influencing teams without formal line management.

Nice To Haves

  • Cloud or SRE‑related certifications.
  • Working knowledge of Azure.
  • Experience with SolarWinds DPA.

Responsibilities

  • Lead the design, architecture, and evolution of enterprise cloud platforms, including AWS Landing Zone (ALZ).
  • Ensure cloud platforms are designed for high availability, fault tolerance, and operational resilience.
  • Design and implement cloud solutions using AWS services such as EC2, S3, ECS, EKS, ELB, RDS, Route 53, Lambda, and API Gateway.
  • Design and automate advanced AWS networking solutions including VPCs, Transit Gateway, VPC Peering, PrivateLink, and Direct Connect.
  • Build and maintain Infrastructure as Code (IaC) using Terraform, CloudFormation, Ansible, and Git‑based workflows.
  • Be accountable for Site Reliability Engineering (SRE) outcomes across cloud platforms and drive adoption of SRE best practices.
  • Standardize reusable modules and patterns that promote consistent, reliable deployments at scale.
  • Define and enforce reliability standards, including availability targets, recovery expectations, and resilience patterns.
  • Ensure instrumentation, monitoring, logging, and alerting are embedded into platforms and services.
  • Act as an escalation point for complex incidents, driving root‑cause analysis and long-term remediation.
  • Design and implement guardrails that enable secure, reliable, self‑service cloud adoption.
  • Enable teams to own end‑to‑end services while meeting reliability and operational standards.
  • Participate in a Agile delivery model, contributing to stories and epics, tracking work in Jira, and supporting sprint releases.
  • Partner closely with Application Development, Development Services, Enterprise Architecture, and Enterprise Security teams.
  • Mentor engineers while promoting a culture of operational excellence and tech modernization to drive client value.
  • Deep understanding of how reliability and availability impact business outcomes and client experience.
  • Ability to balance delivery speed with risk, resilience, and operational sustainability.
  • Experience operating in regulated, risk‑aware environments with strong security and compliance requirements.
  • Makes decisions aligned with enterprise technology strategy while improving MTTR, incident reduction, and platform stability.

Benefits

  • Competitive compensation
  • Annual bonus eligibility
  • A generous retirement plan
  • Hybrid work schedule
  • Health and wellness benefits, including online therapy
  • Paid time off for vacation, illness, medical appointments, and volunteering days
  • Family care resources, including fertility and adoption benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service