SOFTWARE ENGINEERING DIRECTOR I, Production Support Operations

Truist•Richmond, VA

3d•Onsite

About The Position

The Director of Production Support leads teams responsible for ensuring the stability, resilience, and operational excellence of critical technology platforms supporting core lines of business. This role owns end-to-end production support operations while driving maturity toward engineering-first, site reliability–focused practices. The Director identifies and resolves complex technical, operational, risk, and organizational challenges, while building high-performing, accountable teams across onshore and offshore locations. This position carries full people management responsibility, including hiring, coaching, performance management, and disciplinary actions, and serves as a key partner to Technology, Risk, and Business leadership.

Requirements

Bachelor’s degree and equivalent combination of advanced education and experience, which could include any combination of 8 years of experience in IT software engineering, 5 years’ relevant business experience (i.e. making technical-related decisions on the business side), 5 years’ experience in project management, and at least 2 years of management experience.
Broad and in-depth knowledge of technology trends, competitive environment, regulatory requirements and trends, and IT strategies employed to continually meet the demands of clients and regulators.
Ability to translate enterprise level strategic planning information into software and data management needs, create business plans, and turn them into effective business solutions.
Executive level communications skills, including, strong negotiation/facilitation/presentation skills and experience negotiating with vendors for relevant products and services.
Ability to lead projects of significant complexity and risk exposure, particularly with enterprise-wide implications.
Ability to exercise judgment in solving technical, operational, and organizational challenges in the context of complex business objectives and priorities.
Ability to lead and manage the performance of multiple teams against a set of financial and operational objectives.
English (Required) Language Fluency
Able to access and interpret client information received from the computer and able to hear and speak with individuals in person and on the phone.
Able to work standard office equipment, including PC keyboard and mouse, copy/fax machines, and printers.
Able to work all hours scheduled, including overtime as directed by manager/supervisor and required by business need.

Nice To Haves

Understanding of multiple approaches to production support and software engineering delivery.
Full understanding of Agile methodology.
Experience leading teams in an Agile organization, particularly those practicing Site Reliability Engineering.
Experience using AI agents in day-to-day activities, particularly in regard to enabling software delivery and production support operations.
Banking or financial services experience.
Bachelor’s degree and twelve years of experience in software development, production support, including five years of management experience.

Responsibilities

Own end-to-end production support operations for multiple mission-critical applications supporting key lines of business, ensuring availability, stability, and performance meet defined SLAs and SLOs.
Provide accountable, visible leadership for 24x7 operational support, including on-call models, escalation paths, and incident response effectiveness.
Act as the senior escalation point for major incidents, ensuring swift recovery, accurate root cause analysis, and durable remediation.
Lead cross-functional incident recovery efforts in partnership with Incident Management, engineering teams, infrastructure, and business stakeholders.
Ensure timely root cause analysis (RCA), post-incident reviews, and corrective actions that prevent recurrence.
Establish and mature a production knowledge base, documenting known issues, recovery procedures, and architectural insights.
Drive adoption of Site Reliability Engineering (SRE) and lean engineering principles, including: Reduction of toil through automation, Engineering-based reliability metrics (error budgets, SLIs/SLOs), Proactive resilience and failure prevention practices.
Champion automation of repetitive and manual operational tasks, including incident detection, response, validation, and recovery where feasible.
Promote a culture of preventative engineering, partnering with development teams to improve system reliability upstream.
Implement and continuously improve real-time monitoring, alerting, and observability across applications and infrastructure.
Measure and optimize the effectiveness of monitoring and alerting to eliminate noise and accelerate mean-time-to-detect and mean-time-to-recover.
Leverage AI and advanced analytics to correlate telemetry data (logs, metrics, traces) and proactively identify emerging risks and root causes.
Champion the safe and responsible use of AI within production operations by adhering to enterprise guardrails and protecting sensitive data and system integrity.
Oversee operational readiness across releases, disaster recovery and failover testing and certificate and dependency lifecycle management.
Ensure production support is actively embedded in change planning, minimizing risk from releases and infrastructure changes.
Lead one or more Agile teams (Scrum, Kanban), including onshore and offshore engineers, fostering high performance and accountability.
Manage workforce vendors and partners, setting expectations, reviewing performance, and ensuring delivery quality.
Own budget and staffing plan aligned to application criticality, operational risk, and business growth objectives.
Act as the first line of defense in production operations by proactively identifying and mitigating technology, operational, and resiliency risks.
Partner effectively with second-line Risk, Audit, and Regulatory teams, ensuring findings are addressed and controls are continuously improved.
Ensure compliance with internal policies, regulatory requirements, and external audit expectations.
Own and drive remediation plans for risk, audit, and regulatory findings, ensuring timely, effective and sustainable resolution.
Lead responses to audit and regulatory inquiries, including providing evidence, clarifying controls, and appropriately challenging findings based on documented compliance.
Serve as a trusted advisor to senior Technology and Business leaders, communicating operational health, risk posture, and improvement roadmaps.
Lead or contribute significantly to large-scale initiatives, platform transformations, or regulatory-driven efforts.
Continuously assess organizational maturity and lead initiatives to improve reliability, efficiency, and talent capability.
Full people management accountability, including: Hiring and succession planning, Coaching and performance management, Compensation input and talent development, Disciplinary action and terminations as necessary.
Act as an Agile and DevOps champion, embedding production support within fast-moving delivery models.
Balance “keep-the-lights-on” operational excellence with continuous engineering improvement.
Drive measurable outcomes such as improved uptime, reduced incident volume, faster recovery, and improved customer experience.