Principal AI Site Reliability Engineer, EI Production Services

Fidelity•Westlake, TX

7d•Hybrid

About The Position

The EI Production Services organization at Fidelity is seeking a strategic and proactive Principal AI Site Reliability Engineer (SRE). In this role, you will drive operational excellence, observability, and intelligent automation for mission-critical contact center applications supporting Wealth and Workplace Investing business units. You will lead efforts to reduce manual toil, enhance associate experience, and improve system reliability by leveraging AI-driven automation and industry best practices. This position requires a self-starter with strong communication skills, capable of identifying opportunities, leading cross-functional initiatives, and delivering measurable improvements in stability and efficiency. Your work will transform the support model for critical contact center applications, reducing downtime and improving associate productivity. By driving AI-powered automation, observability, and proactive operations, you will enable faster triage, improved resiliency, and a superior experience for associates and customers.

Requirements

10+ years in technology operations, systems engineering, or production support leadership.
Proven ability to deliver complex improvement initiatives in large-scale, high-availability environments.
Deep expertise in IT Service Management (ITSM), incident/problem management, and operational process optimization.
Advanced knowledge of observability and monitoring tools (OTEL, Splunk, DataDog, Prometheus, Grafana).
Experience leveraging AI and automation to drive efficiency and reliability.
Proficiency in scripting and automation (Python, Bash, PowerShell, or similar).
Strong understanding of On-Prem and Public Cloud (AWS/Azure/GCP) environments.
Familiarity with networking, load balancing, and security fundamentals.
Agile and DevOps mindset with experience in CI/CD and operational automation.
Exceptional communication, collaboration, and stakeholder management skills.
Data-driven approach to problem-solving and progress tracking.
Leadership excellence: ability to inspire, mentor, and guide teams toward operational excellence.

Nice To Haves

Optional certifications: ITIL, AWS, SRE-related credentials.

Responsibilities

Lead initiatives to advance observability, automation, and operational efficiency for critical associate-facing applications.
Drive proactive monitoring and AI-powered telemetry to minimize reactive incident response and accelerate resolution.
Collaborate with engineering and business leaders to prioritize and resolve issues impacting associate experience.
Implement automation and self-service capabilities to reduce manual intervention and improve reliability.
Establish and track SLIs/SLOs to measure and optimize system performance.
Communicate progress, outcomes, and technical concepts clearly to senior leadership and stakeholders.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume