Engineer, AI/ML Operations

Comcast•Philadelphia, PA

1d•Onsite

About The Position

Make your mark at Comcast -- a Fortune 30 global media and technology company. From the connectivity and platforms we provide, to the content and experiences we create, we reach hundreds of millions of customers, viewers, and guests worldwide. Become part of our award-winning technology team that turns big ideas into cutting-edge products, platforms, and solutions that our customers love. We create space to innovate, and we recognize, reward, and invest in your ideas, while ensuring you can proudly bring your authentic self to the workplace. Join us. You’ll do the best work of your career right here at Comcast. Xfinity Data Platform (XDP) is an intelligence warehousing platform which provides centralized information and control capabilities over its subscriber’s local area network plus acquire, store and aggregate data related to end user connected devices to Comcast infrastructure. In addition, a modular intelligence logic allows for granular visibility and control of subscriber’s networks, providing device-centric value-added capabilities of creating and managing content access control policies, creating and managing performance policies or issuing presence notifications to the Comcast data ecosystem.

Requirements

AI Ops
Automation
AWS Elastic Kubernetes Service (EKS)
Data Optimization
Machine Learning Operations
Platform Operations
Troubleshooting
Bachelor's Degree (or some combination of coursework and experience, or extensive related professional experience)
5-7 Years Relevant Work Experience

Responsibilities

Maintaining and evolving the XDP platform, which underpins key services across Connected Living.
Supporting new infrastructure and operational requirements for Archetype, MLOps, and AIOps initiatives.
Ensuring the platform remains reliable, secure, and scalable during a period of rapid expansion.
Provide deep specialization in Kubernetes operations and performance tuning.
Provide deep specialization in Cluster stability and security.
Provide deep specialization in Autoscaling, networking, and observability for mission‑critical components.
Managing full ML lifecycle deployment pipelines.
Monitoring, retraining, and model governance.
Operationalizing AIOps patterns that reduce toil and elevate automation maturity.
Operating and troubleshooting critical distributed systems, including: EKS, Spark, Kafka, Redis, and Zookeeper.
Operating and troubleshooting multi‑layered data and compute pipelines that support real‑time and batch workloads.
Operating and troubleshooting high‑availability and disaster‑recovery architecture.
Build high‑efficiency automation frameworks.
Reduce manual operational work using Python or Golang.
Accelerate engineering velocity with AI‑assisted tooling such as GitHub Copilot.
Manage Dev, Staging, and Production workloads across multiple regions globally.
Ensure Environment parity.
Handle Compliance and regional configuration differences.
Support 24/7 global infrastructure.
Reduce compute and storage costs.
Eliminate waste in Kubernetes clusters.
Optimize data pipelines and observability tooling.

Benefits

We believe that benefits should connect you to the support you need when it matters most, and should help you care for those who matter most. That's why we provide an array of options, expert guidance and always-on tools that are personalized to meet the needs of your reality—to help support you physically, financially and emotionally through the big milestones and in your everyday life. Please visit the benefits summary on our careers site for more details.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume