ML Ops Engineer

Southern CompanyAtlanta, GA
1d

About The Position

The ML Ops Engineer will design and operate the production backbone for Southern Company’s AI Hub, ensuring AI and machine learning systems are deployed, monitored, and governed at scale. This role drives the enterprise-wide MLOps framework—establishing standards, lifecycle governance, and observability—while delivering secure, resilient production services and reusable AI products that accelerate innovation across operating companies. Success requires balancing rapid iteration with the reliability, safety, and compliance expected of a critical infrastructure enterprise.

Requirements

  • Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or related field.
  • Proven experience (5 plus years) in cloud engineering or Dev Ops with 2 plus years in MLOps or AI infrastructure, Data Engineering, ML Engineering, or similar role.
  • Experience operating machine learning and AI systems in regulated or mission-critical environments.
  • Strong understanding of ML lifecycle management, including experimentation, validation, deployment, monitoring, and retirement.
  • Familiarity with agentic AI runtime patterns, including orchestration, tool execution, and human-in the-loop controls.
  • Knowledge of enterprise AI governance, observability, and maturity models Manage model and agent lifecycle.
  • Operational mindset with strong ownership and bias toward reliability and automation.
  • Ability to troubleshoot complex, distributed AI systems under production constraints.
  • Clear communicator who can translate operational risks into actionable improvements.
  • Continuous improvement orientation, balancing speed, safety, and cost.
  • Hands-on expertise with CI/CD and MLOps tooling (e.g., GitHub Actions, Azure DevOps, Terraform).
  • Experience deploying and operating LLMs, agents, and inference services using containers and orchestration platforms (e.g., Kubernetes).
  • Proficiency in observability stacks for AI systems (logging, tracing, metrics, evaluation pipelines).
  • Strong grounding in cloud security and identity, including secrets management, network isolation, and least-privilege access.
  • Experience with enterprise model registries, feature stores, vector databases, and automated testing for AI workflows.
  • Deep expertise in Python. Experience with machine learning frameworks and libraries like PyTorch, or scikit-learn.
  • Experience with ML lifecycle tools like MLflow.
  • Experience with cloud computing services (Azure and GCP preferred) and their machine learning tools.

Nice To Haves

  • Relevant certifications in AI, ML, or data engineering.
  • Experience in the energy sector is a plus.
  • Experience in multi-cloud environment is a plus
  • Experience designing reusable AI products, agents, and services in a multi-business environment

Responsibilities

  • Operationalize AI and agentic systems.
  • Build and maintain CI/CD pipelines for models, prompts, tools, and multi-agent workflows, enabling consistent promotion from experimentation to production.
  • Implement AI observability and reliability.
  • Establish monitoring for agent behavior, model performance, drift, cost, and safety outcomes using logs, traces, metrics, and evaluators.
  • Enforce governance through automation.
  • Embed guardrails, approvals, and policy-as-code into deployment pipelines, enabling compliant AI delivery without manual bottlenecks.
  • Manage model and agent lifecycle.
  • Own versioning, rollout strategies (canary, shadow, rollback), and decommissioning for models, agents, and supporting tools.
  • Ensure platform resilience and scalability.
  • Design runtime patterns that meet availability, latency, and fail-safe requirements, including degraded-mode and read-only behaviors for sensitive use cases.
  • Support multi-vendor and multi-cloud execution.
  • Enable portable deployments across hyperscalers and model providers, minimizing lock-in while maintaining consistent operational controls.
  • Partner with engineering and data teams.
  • Work closely with AI Architects, data engineers, and product squads to resolve production issues and continuously improve developer experience
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service