Senior Site Reliability Engineer (SRE) – CloudVision as a Service (CVaaS)

Arista NetworksSanta Clara, CA
19h$101,000 - $161,000

About The Position

We’re looking for Site Reliability Engineers to join our growing Arista’s CloudVision-as-a-Service (CVaaS) global SRE team. SREs at Arista combine strong software engineering background, systems architecture knowledge, with passion for operating production systems at scale. We are responsible for our global CloudVision service fleet, ensuring scalability, reliability, and stability. You’ll have firsthand experience in being part of a rapidly growing product with a passionate group of engineers that unapologetically put product reliability and customer experience first. We deeply believe in building highly automated and self-sustaining environments, prioritizing safe and efficient operations that leverage cutting edge technologies and tools. Arista’s CloudVision is an enterprise network management and streaming telemetry SaaS offering. CloudVision stack is built entirely Kubernetes-native. Familiarity with GCP (Google Cloud Platform) and GKE (Google Kubernetes Engine) is preferred. Our technical stack includes but not limited to: Golang, Python, Ansible/Pulumi, Bash. You will be expected to develop, operate, and work with many different types of databases, both directly on Kubernetes or leveraging managed DB products. We integrate with many different Open Source Software (OSS) projects that both power our microservices stack, monitoring infrastructure, and much more.

Requirements

  • BS/MS degree in Computer Science or a relevant experience subject.
  • 5+ years software engineering experience.
  • Experience developing or managing deployments of distributed database systems or scale out applications for a SaaS environment.
  • Proficiency in Python, Golang, and/or other languages.
  • Expected to be comfortable in Bash and/or other scripting languages.

Responsibilities

  • Data Platform (NetDL) Architecture and Performance
  • Capacity Planning
  • Autoscaling
  • Disaster Recovery
  • Observability
  • Change Management - CI/CD
  • Service Network Architecture
  • Cost Optimizations
  • Instructure and Cloud-First Application Security
  • Continuously improve operational processes by adding automation
  • Leading sustainable incident response and blameless postmortems
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service