Production Services Specialist II

Bank of America•Chandler, AZ

9d•Onsite

About The Position

This job is responsible for providing front-line support to end users, responding to issues related to incidents and problem management governance for multiple applications, and leading triage activities on all business impacting incidents. Key responsibilities include ensuring compliance with incident management and problem management policies and procedures, serving as a focal point for the customer, client, and associate experience, restoring complex production incidents under tight Service Level Agreements, and pursuing root cause and problem resolution follow ups. Incident Leadership (Command Control) Lead major incident bridge calls and take command of triage activities Own engagement strategy, ensuring the right teams are mobilized quickly Direct troubleshooting efforts across multiple network domains Make real-time decisions on escalation, prioritization, and recovery actions Maintain clear control of incident flow, ensuring focused and efficient resolution Technical Execution Drive coordinated troubleshooting across technologies including routing, switching, firewalls, load balancing, and network security Identify service impact and validate findings with technical teams Anticipate failure scenarios and guide mitigation strategies Communication Business Alignment Translate technical issues into clear business impact statements Provide accurate, timely updates to stakeholders and leadership Ensure consistency and clarity in all incident communications Maintain alignment between technical actions and business priorities Governance, Quality Continuous Improvement Ensure all incident records are complete, accurate, and meet enterprise standards Enforce adherence to incident management processes and controls Identify patterns, recurring issues, and systemic risks Drive follow-ups that improve network stability and prevent repeat incidents Maintain and enhance documentation, playbooks, and knowledge artifacts The TRS operates at the center of incident response, leading high-severity network events where speed, clarity, and decisive leadership are essential. Acting as the single point of technical authority during incidents, this role directs cross-functional teams, determines escalation paths, and ensures all actions are aligned to business impact. This role requires the ability to lead under pressure, make decisions with incomplete data, and communicate clearly to both technical teams and senior leadership. This role operates within a 24x7 follow-the-sun Global Network Operations environment and requires flexibility to support continuous technical and operational coverage. Work schedules and shift patterns will be aligned to regional business and operational needs and may include weekends, public holidays. The role is expected to provide technical leadership coverage during assigned shift hours, lead or support major network incident triage and escalation when needed, and partner closely with peer leaders across regions to ensure effective technical handoff, restoration continuity, and sustained service stability.

Requirements

Proven experience leading or directing major incident triage
Deep troubleshooting expertise across core network technologies (routing, switching, firewalls, load balancing, WAN/LAN)
Ability to interpret data from monitoring and logging platforms
Strong analytical thinking and structured problem-solving
Excellent verbal and written communication skills across technical and executive audiences
Ability to operate effectively in high-pressure, time-critical situations

Nice To Haves

Exposure to SDN and cloud networking (e.g., Cisco ACI, NSX, SD-WAN)
Experience with automation (Python, Ansible, APIs)
Familiarity with Agile tools and workflows (JIRA, Confluence)
Knowledge of configuration and network modeling tools (e.g., HPNA, Forward Networks)
Experience in financial services or other highly regulated environments

Responsibilities

Leads production support triage efforts, manages bridge line troubleshooting, engages in technical research, and escalates issues to leadership as needed
Ensures all impacts are accurately recorded and documented in the system of record, oversees that documents and wikis are updated and available for use during triage, and supports the documentation of application flows, upstream/downstream impacts during outages, the customer experience, and contacts for support needs
Identifies and/or validates business impacts through interpretation of monitors, dashboards, and logs to communicate with leadership and vendors
Manages activities to identify incident root cause, resolution, preventative actions, and change requests, and reports on incident data quality
Promotes and enforces production governance during triage/testing and identifies production failure scenarios, vulnerabilities, and opportunities for improvement
Serves as a subject matter expert for applications within a portfolio, leveraging extensive knowledge of application functionalities and application flows
Assesses and prioritizes research requests, ad hoc reports, and offline incidents at the direction of senior team members and delegates work as needed to team members and peers