Site Reliability Engineer – UDF

F5•Seattle, WA

1d•Hybrid

About The Position

This role will be a new member of our Unified Demo Framework (UDF) platform team supporting the launch and management of the F5 Guardrails and Redteam product lines into UDF. The role will focus on designing, deploying, and supporting Kubernetes environments that support a wide variety of use cases across many F5 teams. As a technical expert, the SRE will work closely with cross-functional teams to instantiate AI features, optimize system performance, and ensure reliability in production environments. The ideal candidate will have deep expertise in Kubernetes orchestration, containerized architectures, and builds and runs systems with an operational excellence mindset. This individual will play a critical role in advancing the operational maturity and scalability of the UDF platform and ensure our ability to incorporate new F5 product lines and features.

Requirements

Bachelor’s degree in Computer Science, Software Engineering, or a related technical field (or equivalent experience).
4+ years of experience in Site Reliability Engineering (SRE), DevOps, or similar roles with a focus on container management and AWS usage.
Strong expertise in managing Kubernetes clusters and containerized workloads in production environments.
Hands-on experience deploying and managing Kubernetes environments in AWS, especially using EKS, as well as in self-hosted ecosystems such as on-premise datacenters.
Proficient in monitoring and observability tools, including CloudWatch, Grafana, Fluentd, DataDog, or equivalent platforms.
Expertise with Infrastructure-as-Code (IaC) tools such as Terraform, Helm, or CloudFormation, and CI/CD frameworks.
Solid understanding of networking, storage, and compute infrastructure within containerized environments.
Proficiency in coding and scripting languages, including Python, Go, or Bash, with focus on automation and system integration.
Expertise in applying security best practices to Kubernetes environments, including data protection and resource access controls.
Familiarity with GPU-based workloads in Kubernetes environments and optimization strategies for AI based workloads.
Experience with orchestrating, troubleshooting, best practices, and optimizing complex network environments in AWS and GCP VPCs.
Experience working with hypervisors in GCP VPCs.

Nice To Haves

Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD).
Relevant cloud certifications, such as AWS Certified Solutions Architect or GCP Cloud Architect certifications.
Familiarity with advanced Kubernetes tools and techniques such as service mesh technologies (Istio, Linkerd) or Kubernetes operators for machine learning workflows.
Knowledge of distributed computing concepts and experience supporting large-scale AI workloads.
Practical experience integrating observability and monitoring into pipelines for inference engines and machine learning models.

Responsibilities

Design, deploy, and manage Kubernetes clusters and ensure efficient container orchestration to support AI workloads.
Implement and maintain Kubernetes-based deployment pipelines.
Optimize resource allocation within Kubernetes clusters, while reducing costs and maximizing performance.
Develop and maintain high-availability and fault-tolerant Kubernetes architectures to ensure service continuity.
Design and implement observability pipelines for real-time monitoring of Kubernetes clusters, including metrics collection for scaling, resource utilization, and system health.
Leverage tools such as Cloudwatch, DataDog, Grafana, or similar platforms to ensure visibility into Kubernetes-managed workloads.
Establish logging, tracing, and alerting strategies to enable proactive identification and resolution of performance or reliability issues.
Automate infrastructure management tasks to support the efficient deployment and operation of AI functionalities, including upgrades, scaling, and provisioning.
Support Infrastructure-as-Code (IaC) methodologies for the provisioning and configuration of environments, leveraging tools such as Terraform or Helm.
Contribute to the development of CI/CD workflows tailored for automatic scaling and effective change management practices.
Collaborate with product teams and sales engineering to integrate F5 products into the UDF platform and ensure effective utilization by the sales organization.
Support root cause analysis (RCA) processes for issues affecting the UDF platform, driving long-term corrective actions to improve system reliability.
Provide technical expertise to design operational workflows and procedures that improve the agility and stability of the UDF platform.