Sr. DevOps

ArcherSan Jose, CA
23h

About The Position

As a Senior DevOps Engineer, you will be a key contributor to our infrastructure strategy, focusing on automation, stability, and performance across both cloud and on-premise environments. You will drive best practices in CI/CD, configuration management, and monitoring, with a specific focus on optimizing the deployment and operation of large language models (LLMs) and related technologies.

Requirements

  • 5+ years of professional experience in a DevOps, SRE, or infrastructure engineering role.
  • Deep expertise in containerization and orchestration, specifically Kubernetes (design, deployment, and troubleshooting) and Docker.
  • Strong proficiency in managing infrastructure in both Cloud (e.g., AWS, GCP, Azure) and On-Premise environments.
  • Expert-level administration skills in Linux and strong working knowledge of Windows Server environments.
  • Proven experience with Infrastructure as Code (IaC) and Configuration Management tools (e.g., Terraform, Ansible).
  • High proficiency in scripting and automation using Python and Bash.
  • Extensive experience with monitoring and observability platforms, especially Datadog (or comparable tools like Prometheus/Grafana, New Relic).
  • Hands-on experience deploying and managing technologies related to Large Language Models (LLMs), such as utilizing LiteLLM, OpenRouter, or setting up and managing LLM serving endpoints.

Nice To Haves

  • Experience with specific Kubernetes distributions (e.g., K3s, Rancher, OpenShift).
  • Familiarity with network configuration, firewalls, and security best practices for hybrid environments.
  • Experience in MLOps workflows and related tools (e.g., MLflow, Kubeflow).
  • Certifications such as CKA, CKAD, or relevant cloud provider certifications.

Responsibilities

  • Design, deploy, and manage highly available, scalable infrastructure using Kubernetes and Docker across public cloud (e.g., AWS, GCP, Azure) and on-premise data centers.
  • Develop and maintain robust Configuration Management solutions (e.g., Ansible, Terraform) for consistent environment provisioning and management.
  • Implement and manage CI/CD pipelines to facilitate rapid, reliable, and automated software releases.
  • Administer and troubleshoot operating systems, encompassing both Linux and Windows environments.
  • Implement and optimize observability practices using monitoring tools like Datadog for logging, tracing, and alerting.
  • Spearhead the operational deployment, scaling, and maintenance of LLM infrastructure, leveraging tools like LiteLLM, OpenRouter, or similar LLM orchestration/gateway technologies.
  • Automate repetitive tasks and system operations using scripting languages, primarily Bash and Python.
  • Collaborate closely with development, MLOps, and security teams to ensure infrastructure supports product requirements and compliance standards.
  • Participate in an on-call rotation to ensure service reliability and responsiveness to incidents.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service