14539 – Site Reliability Engineer (Hybrid) – Austin, TX Start Date : ASAP Type: Temporary Project Estimated Duration : 12+ months with possible extensions Work Setting: Hybrid. Position will be 3 days remote with 2 days (Mondays and Thursdays) required to be onsite . Only candidates able to relocate as required should apply to avoid removal from future consideration. Required: Experience in systems engineering, DevOps, or site reliability engineering roles (8+ years); Experience with Linux/Unix systems and system internals (8+ years); Experience in one or more programming/scripting languages (Python, Go, Java, Bash) (8+ years); Experience designing and operating highly available, distributed systems (8+ years); Experience with cloud platforms (AWS, or GCP) and cloud-native services (8+ years); Experience with containerization and orchestration (Docker, Kubernetes) (8+ years); Experience monitoring, alerting, and logging concepts (8+ years); Experience defining and managing SLIs, SLOs, and error budgets (8+ years); Experience with incident management, root cause analysis (RCA), and postmortems (8+ years); Experience integrating security and compliance into operational workflows (8+ years). Preferred: Experience with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk), (4+ years); Experience operating 24x7 production environments with on-call rotations (4+ years); Experience with chaos engineering and resiliency testing (4+ years); Experience with feature flags, canary deployments, and progressive delivery (4+ years); Experience in documentation for runbooks, dashboards, and operational standards (4+ years). Responsibilities include but are not limited to the following: Ensure system reliability and performance by designing, implementing, and maintaining highly available and scalable production systems; Collaborate with development teams to build resilient, observable, and automated platforms that meet defined Service Level Objectives (SLOs); Develop automation tools and scripts (using Python, Go, Bash, etc.) to streamline operational tasks, deployments, and monitoring setups; Implement monitoring and alerting solutions using observability tools (e.g., Prometheus, Grafana, Datadog) to proactively detect and resolve issues; Conduct incident management and root cause analysis (RCA) to improve system resilience and prevent recurring outages; Integrate security and compliance requirements into infrastructure and operational workflows to ensure secure and compliant system performance; Continuously optimize infrastructure and processes by performing cost-benefit analyses, evaluating alternative solutions, and innovating with new technologies.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed
Number of Employees
11-50 employees