Sr. SRE / DevOps Engineer - Sunnyvale, CA (Only Local candidate)

Donato TechnologiesSunnyvale, CA
14hOnsite

About The Position

For this role, we are looking for a Sr. SRE / DevOps Engineer at Sunnyvale, California location. As Site Reliability Engineer, the individual will work closely with multi-functional teams, automate operations, optimize infrastructure, implement security and solve issues in an exciting, fast-paced environment. The individual will play a vital role in ensuring that the systems are reliable, scalable, and high performing.

Requirements

  • 8+ years of experience on DevOps and Site Reliability Engineering.
  • Hands-on with containerization and orchestration: Docker, Kubernetes/EKS.
  • Proficiency in infrastructure as code tools: Terraform, Ansible, or CloudFormation.
  • Experience setting up and managing services running on Kubernetes.
  • In-depth understanding of SRE principals including monitoring, alerting, error budgets, fault analysis, and automation.
  • In-depth knowledge of monitoring and observability tools: Apache Splunk
  • Knowledge of Linux operating system principles, networking fundamentals, and systems management
  • Demonstrable fluency in at least one of the following languages: Java or Python
  • Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
  • Building and managing CI/CD pipeline – gatekeeping production deployments, develop and implement GIT branching strategies, branch protection rules, network policies, scale up/ scale down the load on AWS.
  • Strong problem-solving and analytical skills
  • Solve performance issues and scalability issues in the system.
  • Excellent Communication skills and collaboration skills
  • Ability to propose and implement improvements in the system
  • Ability to work with cross-functional stakeholders
  • Adaptability and a willingness to learn new technologies and techniques.
  • Proactive approach to issues, ability to provide prompt resolution/work

Responsibilities

  • Ensure system reliability and availability
  • Monitor system issues, create strategies to detect issues, address those issues, design automated systems to troubleshoot, write and review post-mortems.
  • Mitigate Operational risks
  • Collaborate with development teams and other stakeholders to identify potential risks, perform risk assessments, implement risk mitigation strategies, continuously monitor and review the effectiveness of risk strategies.
  • Monitor system health.
  • Minimize emergency response (MTTR).
  • Maintain CI/CD pipelines, etc.
  • Continuous improvement by collaborating with various teams.
  • Automation of processes.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service