Manager Site Reliability Engineering

FHLBank ChicagoChicago, IL
$125,825 - $221,275Hybrid

About The Position

We are building a new Site Reliability Engineering function and seeking a leader who can establish SRE practices across the organization while developing a team of engineers new to the discipline. This is a unique opportunity to shape how reliability engineering is practiced at FHLBank Chicago from the ground up. The SRE team operates as a guiding and consultative partner to application and development teams rather than owning systems directly. Success in this role requires technical credibility, strong influencing skills, and the ability to drive change through collaboration and education rather than direct authority. This role is accountable for building and leading the SRE team, including hiring, performance management, coaching, and development of engineers transitioning into SRE practices. The manager establishes the SRE operating model (how SRE engages with application and development teams), ensures sustainable on-call and learning culture, and drives adoption of reliability standards through collaboration and influence.

Requirements

  • 2+ years of Site Reliability Engineering or directly-related primary function
  • 5+ years of experience in infrastructure, operations, DevOps
  • 2+ years of people management experience, with demonstrated ability to develop and grow technical talent
  • Strong technical foundation in systems administration, networking, and infrastructure
  • Proficiency in at least one programming or scripting language (.NET preferred, Python also valuable)
  • Experience with monitoring, observability, and alerting tools and practices
  • Proficiency with agentic AI tools such as Github Copilot, Claude Code, or Codex
  • Demonstrated ability to influence outcomes without direct authority
  • Strong written and verbal communication skills, including ability to explain technical concepts to non-technical stakeholders
  • Experience conducting or leading incident response and postmortem processes
  • Outstanding communication (verbal, written, and listening) skills
  • Proven ability to consistently navigate crucial conversations
  • Critical thinking - using logic and reasoning to identify the strengths and weaknesses of alternative solutions, conclusions or approaches to problems
  • Systems thinking – approaching situations and scenarios understanding they are a complex web of interdependencies between items and other systems that are often initially unclear
  • Ability to present ideas in business-friendly and user-friendly language
  • Attention to detail
  • Comfort with high levels of ambiguity and shared responsibility
  • Pleasant demeanor with others with a good-natured, cooperative attitude
  • Experience with Agile methods and concepts
  • Knowledge of cloud computing principles, specifically related to Amazon Web Services

Nice To Haves

  • Experience implementing SRE practices in an organization new to the discipline
  • Background in financial services or other regulated industries
  • Experience defining and implementing SLIs, SLOs, and error budget policies
  • Experience building or transforming teams through organizational change
  • Knowledge of ITIL, DevOps, or related frameworks

Responsibilities

  • Build and develop a team of engineers transitioning from traditional operations and systems administration backgrounds into SRE practices
  • Create psychological safety that enables learning, experimentation, and honest discussion of failures
  • Establish career development paths and growth opportunities within the SRE discipline
  • Foster a culture of blameless postmortems and continuous improvement
  • Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget policies across critical services
  • Establish pager budgets and on-call practices that are sustainable and effective
  • Lead tuning and optimization of monitoring, alerting, and observability tooling
  • Drive reduction of system disruptions through automation, tooling, and process improvement
  • Develop and maintain incident management processes, including severity classification and escalation procedures
  • Participate in FHLBank’s Disaster Recovery process and testing, coordinating and executing regularly scheduled DR exercises
  • Participate in troubleshooting meetings and production incidents, providing expert guidance and recommendations
  • Partner with application owners, product owners, and development teams to improve system reliability
  • Ensure deep technical analysis is performed for significant reliability issues, and provide escalation support as needed; coach the team in translating findings into actionable recommendations.
  • Advocate for reliability investments and help teams prioritize reliability work against feature development
  • Build relationships that enable SRE to influence architectural and operational decisions without direct ownership
  • Secure and maintain executive sponsorship and governance mechanisms required for SLO and error budget practices (including defined decision rights when reliability thresholds are breached).
  • Communicate the value and principles of SRE to leadership, helping secure sustained support and appropriate resource allocation
  • Develop metrics and reporting that demonstrate SRE impact on business outcomes
  • Navigate organizational dynamics to build credibility and trust for a new function
  • Align SRE practices with existing compliance, risk management, and regulatory requirements

Benefits

  • Highly competitive compensation and bonus package
  • Comprehensive benefits program
  • Retirement program (401k and Pension)
  • Medical, dental and vision insurance
  • Lifestyle Spending Account
  • Competitive PTO plan
  • 11 paid holidays per year
  • Buddy Program
  • Professional development and training opportunities
  • Upskilling
  • Mentorship programs
  • Tuition reimbursement
  • Remote days
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service