AI Support Engineer (II+)

Global Payments Inc.Corpus Christi, TX

About The Position

We are looking for a detail-oriented and technically strong AI Support Engineer to join our AI Operations team. In this critical role, you will be responsible for monitoring, diagnosing, and resolving production incidents across our AI solutions. You’ll work closely with AI engineering, platform, and governance teams to ensure the stability, reliability, and performance of deployed models and agentic solutions across the enterprise. You will join a dynamic team passionate about learning, applying cutting-edge and cost effective technologies, and innovating to deliver high-quality AI solutions.

Requirements

  • 4+ years of experience in production support, software engineering, site reliability engineering (SRE), or DevOps—preferably supporting GenAI and/or ML systems.
  • Strong understanding of cloud infrastructure (AWS, GCP) and AI observability tools (e.g., Fiddler AI, Arize AI, IBM WatsonX.governance, etc.).
  • Experience with LLM and GenAI systems (OpenAI, Azure OpenAI, Bedrock, Vertex AI, or similar).
  • Familiarity with modern orchestration and agentic frameworks such as LangChain, LangGraph, Autogen, or CrewAI.
  • Proficiency in Python or shell scripting for automation and troubleshooting.
  • Strong analytical, communication, and incident management skills.
  • Bachelor’s degree in Computer Science, Engineering, or a related field.
  • 1+ years of experience in AI/ML engineering, with a focus on Generative AI.
  • Proficiency in programming languages such as Python
  • Strong understanding of Generative AI models (e.g., GPT, Transformer architectures) and experience in distilling, tuning and training them.
  • Familiarity with Retrieval Augmented Generation (RAG) techniques and their implementation.
  • Experience with agentic AI concepts and developing autonomous AI workflows.
  • Hands-on experience with GCP Vertex AI, AWS Bedrock + Sagemaker, and Snowflake Cortex platforms and their AI/ML capabilities.
  • Experience building production-grade AI/ML systems at scale.
  • Knowledge of MLOps practices, including model deployment and lifecycle management.
  • Excellent problem-solving and analytical skills.
  • Excellent communication and collaboration skills.

Nice To Haves

  • Familiarity with Prompt Engineering, RLHF, and model evaluation techniques.
  • Understanding of AI governance, safety, and responsible principles.
  • Understanding of reinforcement learning and its application in agentic AI.
  • Familiarity with big data technologies (Apache Spark, Kafka)
  • Experience with CI/CD tools and automation for AI/ML workflows.
  • Experience with real-time data processing and streaming analytics.

Responsibilities

  • Serve as the first line of defense for production AI incidents, ensuring rapid triage, root cause analysis, and resolution.
  • Monitor system health and performance of deployed AI applications, agentic and RAG-based solutions, MCPs, and orchestration platforms.
  • Track and investigate issues related to latency, failures, model drift, hallucination, prompt misbehavior, or broken integrations, escalating to the AI engineering group where appropriate.
  • Collaborate with AI and platform engineers to implement observability, logging, and alerting best practices for all AI services.
  • Build diagnostic tools, runbooks, and automated workflows to improve incident response time and reduce manual intervention.
  • Maintain knowledge bases and playbooks for repeatable troubleshooting and knowledge transfer.
  • Partner with governance and compliance teams to ensure incidents are documented and remediated in line with internal policy.
  • Contribute to postmortems and continuous improvement efforts to harden production systems.
  • Availability for on-call rotation and support.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service