Digital - Principal SRE (AI Engineer)

Huntington National Bank•Columbus, OH

49d•Hybrid

About The Position

The Digital - Principal SRE (AI Engineer) role is a position that blends expertise in artificial intelligence, machine learning, and reliability engineering. This professional is responsible for designing, deploying, and maintaining AI-driven solutions while ensuring the reliability, scalability, and performance of digital platforms and services. The ideal candidate will work closely with Digital SRE engineers, data scientists, DevOps, and operations teams to deliver robust, efficient, and automated systems that support business goals. Job Description Summary: The IS Technical Specialist provides technical and consultative support on the most complex technical matters. This role typically reports to the Head of Digital SRE and may involve on-call responsibilities. The position provides opportunities to work on cutting-edge AI solutions, collaborate with cross segment teams, and drive reliability for mission-critical digital services

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, Data Science, or a related field.
5+ years experience with AI/ML engineering, SRE, DevOps, or related roles.
5+ years experience programming skills in Python, Java, or similar languages, with experience in developing and deploying machine learning models.
5+ years hands-on experience with cloud platforms (e.g., AWS, GCP) and containerization technologies (Docker, Kubernetes).
Familiarity with observability tools (Prometheus, Grafana, ELK stack) and Service Now incident management platforms.
Solid understanding of SRE principles: monitoring, alerting, SLOs, error budgets, and automation.
5+ years experience with infrastructure-as-code (Terraform, Ansible) and CI/CD pipelines.

Nice To Haves

Excellent problem-solving skills, attention to detail, and ability to work in a fast-paced, collaborative environment.
Strong communication and documentation abilities
Experience operationalizing large language models (LLMs) or generative AI systems in production settings.
Background in MLOps, data engineering, and/or cloud-native AI deployment.
Knowledge of security best practices for AI and cloud infrastructure.
Contributions to open source AI/SRE projects or relevant technical communities

Responsibilities

Design, develop, and implement AI-driven systems and automation tools to enhance the reliability and efficiency of digital platforms.
Monitor the health, availability, and performance of AI-enabled applications and infrastructure using SRE best practices.
Collaborate with cross-functional teams to integrate machine learning models into production environments, ensuring seamless deployment and operation.
Establish and enforce service-level objectives (SLOs), error budgets, and incident response procedures for AI-driven services.
Identify, troubleshoot, and resolve complex incidents related to AI systems, leveraging observability and monitoring tools.
Drive continuous improvement by analyzing post-incident reviews, automating manual tasks, and optimizing system performance.
Stay up to date with advancements in AI, SRE, and cloud technologies, recommending innovative solutions to enhance digital reliability.
Document processes and runbooks for operational transparency and knowledge sharing.
Develop abstraction layers across AI providers (Google, OpenAI, etc. ) to enable seamless integration and enablement.
Conduct design workshops, POCs, and code-with sessions to shape data-driven agent workflows with stakeholders, fostering trust and adoption.
Define and use key metrics, test harnesses, and evaluation plans to measure agent accuracy, latency, safety, and cost effectiveness.
Craft reusable patterns, documentation, and best practices to influence internal assets and client roadmaps.