Principal Platform Engineer, Kubernetes

Comcast•Willistown Township, PA

About The Position

Make your mark at Comcast -- a Fortune 30 global media and technology company. From the connectivity and platforms we provide, to the content and experiences we create, we reach hundreds of millions of customers, viewers, and guests worldwide. Become part of our award-winning technology team that turns big ideas into cutting-edge products, platforms, and solutions that our customers love. We create space to innovate, and we recognize, reward, and invest in your ideas, while ensuring you can proudly bring your authentic self to the workplace. Join us. You’ll do the best work of your career right here at Comcast. (In most cases, Comcast prefers to have employees on-site collaborating unless the team has been designated as virtual due to the nature of their work. If a position is listed with both office locations and virtual offerings, Comcast may be willing to consider candidates who live greater than 100 miles from the office for the remote option.) Job Summary As a Principal Platform Engineer, you will be responsible for designing, implementing, and maintaining our Kubernetes infrastructure. You will work closely with development teams to ensure the platform is reliable, scalable, and efficient. Your expertise will drive our cloud-native strategy and enhance our ability to deploy and manage containerized applications. Job Description

Requirements

Bachelor's degree in computer science or a related field, or equivalent experience, typically 12 years in a DevOps or Systems Engineering role.
Familiarity with containerized technologies such as Docker, Kubernetes etc.,
Experience implementing continuous integration and continuous delivery (CI/CD) tools and systems.
Proficiency in programming languages such as Python, Java, Shell scripting (Bash).
Automation scripting with tools such as Ansible playbooks.
Deploying infrastructure via Terraform
Strong understanding of networking fundamentals, including TCP/IP, DNS, IPv4/IPv6 networking, Load Balancing, and protocols
Familiarity with CNCF ecosystem tools and emerging trends in platform engineering
Experience designing, building, deploying, and maintaining infrastructure, including Kubernetes clusters
Experience upgrading Kubernetes clusters with no to minimal downtime.
Experience configuring service mesh, network policy controls, and multi-tenancy in Kubernetes
Strong Kubernetes, cloud native, containerization expertise in a hybrid-cloud enterprise environment and as a solution architect
Strong Spark skills
Excellent analytical and problem-solving skills with the ability to effectively communicate complex technical information
Strong written communication skills are essential, as well as the ability to create clear and informative documentation
Ability to work effectively across internal and external organizations
Flexibility to work off-hours for on-call duties

Nice To Haves

Relevant certifications, such as Certified Kubernetes Administrator (CKA)

Responsibilities

Serve as the technical lead and owner for the Compute platform (On Prem and EKS)
Responsible for the deployment of containerized applications across a cluster of bare-metal servers using ansible, terraform etc.,
Facilitate automatic scaling of containerized applications based on demand
Manage utilization of compute resources such as CPU, memory, and storage across the cluster
Implement monitoring solutions (e.g., Prometheus, Grafana) to track the health and performance of bare metal clusters and infrastructure components
Set up alerting mechanisms to detect and respond to issues proactively
Recover from failures by restarting failed containers or reallocating workloads to healthy nodes
Work closely with development and engineering teams to establish CI/CD pipelines for automating the deployment and rollout of Kubernetes services
Support seamless rolling updates allowing new versions to be deployed gradually while maintaining application availability
Identify performance bottlenecks in containerized environments and optimize resource utilization through capacity planning, auto-scaling, and performance tuning
Document processes, procedures, and best practices related to the platform operations and share knowledge with team members
Partner with SREs to define platform SLAs, uptime targets, resilience benchmarks, and alerting/monitoring
Lead incident response and root cause analysis, automating recovery workflows and improving platform resiliency