Arista Networks-posted 3 months ago
Austin, TX
1,001-5,000 employees
Computer and Electronic Product Manufacturing

This is not a traditional operations role. You will inherit a set of critical, manual, and hands-on operational responsibilities essential to our customers' success. We need you to help lead the effort to systematically dismantle this operational burden through automation, tooling, and systems. You will have a collaborative team of excellent engineers and a counterpart to you to work with on both the manual toil and the systems we need to engineer. The short-term needs are: manual deployments, reactive troubleshooting, and on-call escalations. But we need you to help us build a system where programmatic solutions have replaced human intervention. You must have the pragmatism to manage the current reality and the systematic impatience to build its replacement. Success in this role requires a dual mindset. You must be a skilled incident leader who can stabilize a crisis and a deliberate systems architect who can prevent the next one. You will work closely with our internal tools, platform, and product engineering teams to channel your direct operational knowledge into durable, long-term solutions.

  • Phase 1: Stabilize and Map (First 3-6 Months). You will embed with the team, taking ownership of the existing operational workload alongside the other customer SRE person covering the India time zone and product engineers. This includes customer deployments, upgrades, and incident response. Your initial goal is to achieve stability while mapping the landscape of our operational toil.
  • Phase 2: Automate and Influence (Months 6-18). Armed with your map of toil, you will begin to automate. You will write code, build tooling, and deploy declarative infrastructure to eliminate the most critical operational burdens. For larger projects, you will act as a primary stakeholder, providing clear requirements to our internal tooling and platform teams and ensuring their solutions meet the operational need. Your success will be measured by a demonstrable reduction in the overall support effort, fewer pages, support escalations, and manual tasks.
  • Phase 3: Architect and Evangelize (Year 2+). With the most acute operational pains addressed, your focus will shift to architectural concerns. You will define and implement Service Level Objectives (SLOs), influence the design of new products for operability, and help instill SRE principles throughout the engineering organization.
  • Strong background in Site Reliability Engineering or a closely related DevOps function.
  • Strong command of Linux systems administration and understanding of networking fundamentals (TCP/IP, DNS, routing).
  • Experience working directly with external customers to solve difficult technical problems.
  • Production experience with a major cloud provider, preferably AWS, and proficiency in its core concepts and services (VPC, EC2, IAM, S3).
  • Experience building and managing infrastructure as code with tools like Terraform.
  • Hands-on experience instrumenting applications and managing telemetry pipelines for metrics, logs, and traces.
  • Proficient in writing code to automate operational tasks, with expertise in a high-level language like Python or Go, and strong shell scripting skills (e.g., Bash).
  • Proficiency with Kafka, Postgres, nginx, systemd, etc.
  • Proficiency in Nix and NixOS.
  • Exposure to or proficiency in functional programming languages and paradigms.
  • Diversity of thought and perspectives.
  • Inclusive environment fostering creativity and innovation.
  • Recognition for excellence in engineering and diversity.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service