Software Engineering Manager-Site Reliability Center-Twilight

PNC•Pittsburgh, PA

1d•$100,100 - $204,490•Onsite

About The Position

At PNC, our people are our greatest differentiator and competitive advantage in the markets we serve. We are all united in delivering the best experience for our customers. We work together each day to foster an inclusive workplace culture where all of our employees feel respected, valued and have an opportunity to contribute to the company’s success. As a Software Engineering Manager for PNC's Site Reliability Engineering Center, you will work within PNC's Information Technology Group and be located at one of our IT Hubs: Cleveland, Ohio; Birmingham, Alabama; Pittsburgh, Pennsylvania; Dallas, Texas; Denver, Colorado or Phoenix, Arizona and manage the twilight shift. The Site Reliability Center (SRC) is focused on establishing a culture of operational excellence by ensuring infrastructure, platforms, and applications adhere to SRC onboarding standards that improve reliability, enable proactive issue resolution, and reduce customer impact. This role supports the vision of building a collaborative technology organization across application, infrastructure, and security teams to deliver a stable, reliable, and secure environment. Key responsibilities include driving customer-centric service improvements, implementing proactive and preventative reliability practices, fostering cross-functional collaboration, enhancing monitoring and observability capabilities, promoting a blameless culture of continuous learning, and reducing operational toil through automation. The ideal candidate will help improve service performance, strengthen operational resiliency, and advance automation and observability initiatives that enhance the overall customer experience. As a Software Engineering Manager – Site Reliability Engineering (SRE), you will lead a team responsible for ensuring the reliability, scalability, and operational excellence of mission-critical platforms that power PNC’s digital experiences. This role blends technical leadership, hands-on problem solving, and people management, driving both production stability and continuous improvement across complex distributed systems.

Requirements

5 + years of related experience and 3+ years of management experience.
Strong experience in Site Reliability Engineering, Production Support, or DevOps.
Proven ability to lead teams in high-availability, enterprise environments
Deep understanding of incident, problem, and change management frameworks
Hands-on knowledge of monitoring tools, cloud/infrastructure platforms, and automation
Experience improving system reliability, observability, and operational maturity
Strong communication skills with the ability to lead during high-pressure situations.
Experience with OCP under infrastructure (Linux/Windows, OCP), MongoDB, Cassandra under databases (Oracle, SQL, MongoDB, Cassandra) and working knowledge of Elasticsearch, Redis, MQ and Kafka is a plus.
Bachelors
Roles at this level typically require a university / college degree, with 5+ years of industry-relevant experience.
At least 3 years of prior management experience is typically required.
In lieu of a degree, a comparable combination of education, job specific certification(s), and experience (including military service) may be considered.

Nice To Haves

Experience with OCP under infrastructure (Linux/Windows, OCP), MongoDB, Cassandra under databases (Oracle, SQL, MongoDB, Cassandra) and working knowledge of Elasticsearch, Redis, MQ and Kafka is a plus.

Responsibilities

Manage SRE and related Teams; lead, coach, and develop a team of SRE engineers; set clear goals, drive accountability, and foster a culture of ownership and excellence; partner with cross-functional stakeholders to align technology and business objectives; support talent development, performance management, and succession planning; encourage innovation, continuous learning, and DevOps/SRE best practices.
Lead incident management & remediation; manage and actively participate in end-to-end incident response for major (P1/P2) incidents; guide real-time triage, diagnostics, and troubleshooting across application, infrastructure, and network layers; ensure rapid execution of remediation actions and service restoration; provide clear, timely communication to stakeholders during incidents; oversee post-incident analysis, reporting, and documentation to drive improvements.
Provide technical leadership in production support; serve as an escalation point for complex production issues; guide troubleshooting across: applications, infrastructure (Linux/Windows), databases (Oracle, SQL), middleware and integrations; ensure efficient log, metric, and system analysis; oversee batch/ETL monitoring and recovery processes; foster strong collaboration across engineering, infrastructure, and vendor teams.
Drive problem management & root cause resolution; lead root cause analysis (RCA) efforts for major and recurring incidents; ensure ownership and resolution of problem records; drive permanent fixes and systemic improvements to eliminate repeat issues, identify trends and patterns to reduce risk and improve stability; partner with engineering teams to resolve code defects and system gaps and promote knowledge sharing via runbooks, knowledge articles, and error catalogs.
Oversee change management & release execution; ensure safe and compliant execution of production changes and releases; validate change readiness, testing, rollback strategies, and risk assessments; represent the team in CAB reviews, providing technical risk evaluation; oversee post-implementation reviews (CPIR) and ensure follow-through and drive improvements in change success rate and reduction in production defects.
Advance monitoring, alerting & observability; lead efforts to build and optimize monitoring, dashboards, and alerting frameworks, champion use of tools such as Dynatrace, BigPanda, Logscale, and enterprise platforms, improve signal-to-noise ratio through alert tuning; enable proactive issue detection before customer impact; strengthen event management and observability practices.
Provide technical leadership in production support; serve as an escalation point for complex production issues; guide troubleshooting across: applications, infrastructure (Linux/Windows), databases (Oracle, SQL), middleware and integrations; ensure efficient log, metric, and system analysis; oversee batch/ETL monitoring and recovery processes; foster strong collaboration across engineering, infrastructure, and vendor teams.
Champion resiliency, stability & availability; lead efforts to ensure high availability of critical systems; oversee disaster recovery, failover, and continuity testing; identify and eliminate single points of failure and drive improvements in MTTR, uptime, and service reliability.
Enable scalability & performance optimization; guide capacity planning and performance tuning strategies; ensure systems scale effectively under peak demand; partner with development teams for performance-driven design improvements; optimize system configurations to improve efficiency and throughput.
Lead a 24x7 production support model; manage team participation in a 24x7 on-call rotation; oversee engagement in incident bridges, war rooms, and escalations; support pod-based operating models aligned to key applications; ensure seamless handoffs and global support continuity.
Drive Automation & Operational Efficiency; identify and prioritize opportunities to reduce manual effort through automation; implement automation across: Incident remediation, monitoring and alerting, deployment and validation, promote standardized runbooks and automation frameworks and improve operational metrics and reduce toil.
Ensure Governance, Risk & Compliance; maintain adherence to enterprise policies and regulatory standards; support audits, vulnerability remediation, and risk controls; ensure accurate documentation and operational procedures and champion security, access management, and data governance practices.

Benefits

medical/prescription drug coverage (with a Health Savings Account feature)
dental and vision options
employee and spouse/child life insurance
short and long-term disability protection
401(k) with PNC match
pension and stock purchase plans
dependent care reimbursement account
back-up child/elder care
adoption, surrogacy, and doula reimbursement
educational assistance, including select programs fully paid
a robust wellness program with financial incentives
maternity and/or parental leave
up to 11 paid holidays each year
9 occasional absence days each year, unless otherwise required by law
between 15 to 25 vacation days each year, depending on career level; and years of service.