Manager AI & ML Engineering

ERCOTTaylor, TX
23h

About The Position

At ERCOT, our diverse and dynamic work environment provides a platform on which employees can work together to build the future of the Texas power grid and wholesale market utilizing the latest technologies and resources. We encourage you to join our talented, dedicated workforce to develop world-class solutions for today and tomorrow’s energy challenges while learning new skills and growing your career. ERCOT is committed to fostering inclusion at all levels of our company. It is the cornerstone of our corporate values of accountability, leadership, innovation, trust, and expertise. We know that individuals with a wide variety of talents, ideas, and experiences propel the innovation that drives our success. An inclusive and diverse workforce strengthens us and allows for a collaborative environment to solve the challenges that face our industry today and in the future. JOB SUMMARY Leads the team responsible for developing, deploying, and operating machine learning models, generative AI applications, autonomous agents, and related AI solutions across ERCOT’s enterprise platforms. Oversees MLOps standards, production support, platform reliability, and governance for ML and GenAI assets. Balances delivery of new AI capabilities with operational excellence and ensures compliance with AI governance and model lifecycle controls. Partners closely with Data Operations, Data Engineering, Governance, Security, and business stakeholders to ensure safe, reliable, and efficient AI systems. JOB DUTIES Responsible for hiring, coaching, training, and performance management of staff. Frequently interacts with reporting supervisors, customers, and/or functional peer group managers, normally involving matters between functional areas or customers. Responsible for the management of subordinate staff within a department. Typically has individual contributors as direct reports, but could have supervisory direct reports. Has full responsibility for direct reports. Generally provides input to budgeting and financial decisions that impact the department. Requests approval for financial actions beyond a limited scope. ADDITIONAL JOB DUTIES Oversee end‑to‑end delivery of AI/ML and GenAI solutions, from design through deployment, ensuring enterprise‑ready quality, reliability, and security. Set technical direction and architectural standards for ML models, GenAI applications, autonomous agents, RAG systems, multimodal solutions, and vector/semantic search capabilities. Own and govern MLOps standards, including CI/CD automation, deployment pipelines, monitoring, evaluation frameworks, and model lifecycle controls for both ML and GenAI assets. Lead and develop the AI & ML Engineering team, including hiring, onboarding, coaching, performance management, and establishing clear skill ladders and growth pathways. Manage production ML/GenAI operations and Level 3 support, leading root‑cause investigations, incident command, post‑incident reviews, and long‑term problem management. Ensure compliance with ERCOT model governance and GenAI‑specific controls, including risk tiering, documentation, lineage, prompt management, safety guardrails, and regulatory requirements. Guide platform engineering for AI/ML infrastructure, including Azure ML, Databricks ML, vector databases, LLM orchestration frameworks, and ML/GenAI observability tooling. Plan and prioritize intake, releases, and roadmaps for ML and GenAI initiatives in partnership with Product Owners and Data Operations leadership. Oversee vendor and contractor contributions to ensure quality, maintain architectural integrity, and achieve knowledge transfer into ERCOT’s internal teams. Collaborate across Data Engineering, Architecture, Governance, Security, and business stakeholders to align AI/ML solutions with enterprise needs and regulatory responsibilities. Review and approve high‑risk deployments and exceptions, ensuring compensating controls are in place for ML and GenAI systems. Establish and track performance, reliability, and cost metrics for ML infrastructure, LLM usage, GenAI applications, and overall MLOps health. Communicate operational status, risks, and trade‑offs to executive stakeholders and technical partners with clarity and accountability.

Requirements

  • 8+ years in ML operations, MLOps engineering, AI/ML development, data engineering, or software engineering with ML/AI focus
  • 2+ years leading ML operations teams or technical teams in ML/AI environments
  • Demonstrated experience with enterprise-scale ML deployment, operations, and GenAI application development
  • Bachelor’s Degree: Computer Science, Data Science, or related filed (Required)

Nice To Haves

  • Experience building and deploying ML/GenAI solutions using platforms like Azure ML, Azure AI, Databricks ML, and Azure OpenAI.
  • Strong background in LLMs, RAG/semantic search, and AI agent or multi‑agent architectures.
  • Proven MLOps expertise, including CI/CD for ML, model serving, monitoring, and production support.
  • Leadership experience guiding technical teams and aligning engineering work with business and governance needs.
  • Proficiency in Python and modern data/AI engineering practices, with familiarity in cloud infrastructure, vector databases, or AI observability tools.
  • Master’s Degree: Computer Science, Data Science, or related filed (Preferred)

Responsibilities

  • Responsible for hiring, coaching, training, and performance management of staff.
  • Responsible for the management of subordinate staff within a department.
  • Oversee end‑to‑end delivery of AI/ML and GenAI solutions, from design through deployment, ensuring enterprise‑ready quality, reliability, and security.
  • Set technical direction and architectural standards for ML models, GenAI applications, autonomous agents, RAG systems, multimodal solutions, and vector/semantic search capabilities.
  • Own and govern MLOps standards, including CI/CD automation, deployment pipelines, monitoring, evaluation frameworks, and model lifecycle controls for both ML and GenAI assets.
  • Lead and develop the AI & ML Engineering team, including hiring, onboarding, coaching, performance management, and establishing clear skill ladders and growth pathways.
  • Manage production ML/GenAI operations and Level 3 support, leading root‑cause investigations, incident command, post‑incident reviews, and long‑term problem management.
  • Ensure compliance with ERCOT model governance and GenAI‑specific controls, including risk tiering, documentation, lineage, prompt management, safety guardrails, and regulatory requirements.
  • Guide platform engineering for AI/ML infrastructure, including Azure ML, Databricks ML, vector databases, LLM orchestration frameworks, and ML/GenAI observability tooling.
  • Plan and prioritize intake, releases, and roadmaps for ML and GenAI initiatives in partnership with Product Owners and Data Operations leadership.
  • Oversee vendor and contractor contributions to ensure quality, maintain architectural integrity, and achieve knowledge transfer into ERCOT’s internal teams.
  • Collaborate across Data Engineering, Architecture, Governance, Security, and business stakeholders to align AI/ML solutions with enterprise needs and regulatory responsibilities.
  • Review and approve high‑risk deployments and exceptions, ensuring compensating controls are in place for ML and GenAI systems.
  • Establish and track performance, reliability, and cost metrics for ML infrastructure, LLM usage, GenAI applications, and overall MLOps health.
  • Communicate operational status, risks, and trade‑offs to executive stakeholders and technical partners with clarity and accountability.

Benefits

  • ERCOT offers an excellent benefits package, which includes health, dental, vision, life insurance, long/short-term disability insurance, long-term care insurance, Section 125 Flexible Spending Account, and a Retirement Savings Plan.
  • Additionally, 401(k) plans are available to help employees plan for the future.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service