Senior Site Reliability Engineer

Allied Consultants•Austin, TX

3d•Hybrid

About The Position

Allied Consultants, Inc is a proudly Austin based firm with over 34 years of experience delivering top-tier technical and business professionals within Texas State Agencies. We are currently seeking an experience Senior Site Reliability Engineer to play a key role within a high-impact technical services team. At Allied Consultants, we value our consultants and are committed to providing an exceptional experience including: Highly competitive pay rates Local support staff for responsive, personal service Comprehensive benefits package, including: Medical insurance (with employer cost sharing) Life insurance A 401(K) plan with company match Flexible spending through a cafeteria plan Candidates selected for interviews will be subject to a criminal background check and may be required to pass a drug screening, in compliance with federal and state regulations. All offers of employment are contingent upon successful completion of these checks. Allied Consultants is a proud to be an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs). Location of job: Hybrid, 3 days remote with 2 days (Mondays and Thursdays) onsite. Candidates must be local to Austin, TX

Requirements

8 or more years of experience, relies on experience and judgment to plan and accomplish goals, independently performs a variety of complicated tasks, a wide degree of creativity and latitude is expected.
8 experience in systems engineering, DevOps, or site reliability engineering roles
8 Strong experience with Linux/Unix systems and system internals
8 Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
8 Experience designing and operating highly available, distributed systems
8 Strong knowledge of cloud platforms (AWS, or GCP) and cloud-native services
8 Experience with containerization and orchestration (Docker, Kubernetes)
8 Strong understanding of monitoring, alerting, and logging concepts
8 Experience defining and managing SLIs, SLOs, and error budgets
8 Familiarity with incident management, root cause analysis (RCA), and postmortems
8 Experience integrating security and compliance into operational workflows

Nice To Haves

4 Preferred Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
4 Preferred Experience operating 24x7 production environments with on-call rotations
4 Preferred Experience with chaos engineering and resiliency testing
4 Preferred Experience with feature flags, canary deployments, and progressive delivery
4 Preferred Strong documentation skills for runbooks, dashboards, and operational standards

Responsibilities

Understands business objectives and problems, identifies alternative solutions, performs studies and cost/benefit analysis of alternatives.
Analyzes user requirements, procedures, and problems to automate processing or to improve existing computer system: Confers with personnel of organizational units involved to analyze current operational procedures, identify problems, and learn specific input and output requirements, such as forms of data input, how data is to be; summarized, and formats for reports.
Writes detailed description of user needs, program functions, and steps required to develop or modify computer program.
Reviews computer system capabilities, specifications, and scheduling limitations to determine if requested program or program change is possible within existing system.
Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations.
Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).