Site Reliability Engineer (Hybrid)

Vitaver & AssociatesAustin, TX
19hHybrid

About The Position

14539 – Site Reliability Engineer (Hybrid) – Austin, TX Start Date : ASAP Type: Temporary Project Estimated Duration : 12+ months with possible extensions Work Setting: Hybrid. Position will be 3 days remote with 2 days (Mondays and Thursdays) required to be onsite . Only candidates able to relocate as required should apply to avoid removal from future consideration. Required: Experience in systems engineering, DevOps, or site reliability engineering roles (8+ years); Experience with Linux/Unix systems and system internals (8+ years); Experience in one or more programming/scripting languages (Python, Go, Java, Bash) (8+ years); Experience designing and operating highly available, distributed systems (8+ years); Experience with cloud platforms (AWS, or GCP) and cloud-native services (8+ years); Experience with containerization and orchestration (Docker, Kubernetes) (8+ years); Experience monitoring, alerting, and logging concepts (8+ years); Experience defining and managing SLIs, SLOs, and error budgets (8+ years); Experience with incident management, root cause analysis (RCA), and postmortems (8+ years); Experience integrating security and compliance into operational workflows (8+ years). Preferred: Experience with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk), (4+ years); Experience operating 24x7 production environments with on-call rotations (4+ years); Experience with chaos engineering and resiliency testing (4+ years); Experience with feature flags, canary deployments, and progressive delivery (4+ years); Experience in documentation for runbooks, dashboards, and operational standards (4+ years). Responsibilities include but are not limited to the following: Ensure system reliability and performance by designing, implementing, and maintaining highly available and scalable production systems; Collaborate with development teams to build resilient, observable, and automated platforms that meet defined Service Level Objectives (SLOs); Develop automation tools and scripts (using Python, Go, Bash, etc.) to streamline operational tasks, deployments, and monitoring setups; Implement monitoring and alerting solutions using observability tools (e.g., Prometheus, Grafana, Datadog) to proactively detect and resolve issues; Conduct incident management and root cause analysis (RCA) to improve system resilience and prevent recurring outages; Integrate security and compliance requirements into infrastructure and operational workflows to ensure secure and compliant system performance; Continuously optimize infrastructure and processes by performing cost-benefit analyses, evaluating alternative solutions, and innovating with new technologies.

Requirements

  • Experience in systems engineering, DevOps, or site reliability engineering roles (8+ years)
  • Experience with Linux/Unix systems and system internals (8+ years)
  • Experience in one or more programming/scripting languages (Python, Go, Java, Bash) (8+ years)
  • Experience designing and operating highly available, distributed systems (8+ years)
  • Experience with cloud platforms (AWS, or GCP) and cloud-native services (8+ years)
  • Experience with containerization and orchestration (Docker, Kubernetes) (8+ years)
  • Experience monitoring, alerting, and logging concepts (8+ years)
  • Experience defining and managing SLIs, SLOs, and error budgets (8+ years)
  • Experience with incident management, root cause analysis (RCA), and postmortems (8+ years)
  • Experience integrating security and compliance into operational workflows (8+ years)

Nice To Haves

  • Experience with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk), (4+ years)
  • Experience operating 24x7 production environments with on-call rotations (4+ years)
  • Experience with chaos engineering and resiliency testing (4+ years)
  • Experience with feature flags, canary deployments, and progressive delivery (4+ years)
  • Experience in documentation for runbooks, dashboards, and operational standards (4+ years)

Responsibilities

  • Ensure system reliability and performance by designing, implementing, and maintaining highly available and scalable production systems
  • Collaborate with development teams to build resilient, observable, and automated platforms that meet defined Service Level Objectives (SLOs)
  • Develop automation tools and scripts (using Python, Go, Bash, etc.) to streamline operational tasks, deployments, and monitoring setups
  • Implement monitoring and alerting solutions using observability tools (e.g., Prometheus, Grafana, Datadog) to proactively detect and resolve issues
  • Conduct incident management and root cause analysis (RCA) to improve system resilience and prevent recurring outages
  • Integrate security and compliance requirements into infrastructure and operational workflows to ensure secure and compliant system performance
  • Continuously optimize infrastructure and processes by performing cost-benefit analyses, evaluating alternative solutions, and innovating with new technologies
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service