Site Reliability Engineer

Cayuse Holdings•Cedar Park, TX

4d•Hybrid

About The Position

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, scalability, and performance of the organization’s production systems. This role combines software engineering and systems engineering practices to automate and improve infrastructure operations, reduce manual work, and enable rapid response to incidents. The SRE partners with development, operations, and business teams to drive continuous improvement, implement resilient systems, and meet well-defined service level objectives (SLOs). This position aligns with Cayuse’s core values of Innovation, Excellence, Collaboration, Adaptability, and Integrity by fostering technical solutions that meet customer needs, promoting teamwork, and prioritizing quality in deliverables.

Requirements

8 years of experience in systems engineering, DevOps, or site reliability engineering roles.
8 years of strong experience with Linux/Unix systems and system internals.
8 years of proficiency in one or more programming/scripting languages (e.g., Python, Go, Java, Bash).
8 years of experience designing and operating highly available, distributed systems.
8 years of strong knowledge of cloud platforms (such as AWS or GCP) and cloud-native services.
8 years of experience with containerization and orchestration (e.g., Docker, Kubernetes).
8 years of strong understanding of monitoring, alerting, and logging concepts.
8 years of experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
8 years of familiarity with incident management, root cause analysis (RCA), and postmortems.
8 years of experience integrating security and compliance into operational workflows.
Must be able to pass a background check. May require additional background checks as required by projects and/or clients at any time during employment.
Exceptional interpersonal skills with the ability to communicate in a clear, professional, and articulate manner.
Exceptional verbal and written communication skills.
Excellent organizational, analytical, and problem-solving skills with high-level attention to detail.
Ability to analyze systems and procedures
Strong multitasking skills with the ability to manage multiple design streams across concurrent work effort.
Must be self-motivated and able to work well independently as well as on a multi-functional team.
Ability to handle sensitive and confidential information appropriately.

Nice To Haves

4 years of familiarity with observability tools such as Prometheus, Grafana, Application Insights, Datadog, or Splunk.
4 years of experience operating 24x7 production environments, including participation in on-call rotations.
4 years of experience with chaos engineering and resiliency testing.
4 years of experience with feature flags, canary deployments, and progressive delivery strategies.
4 years of strong documentation skills for creating runbooks, dashboards, and operational standards.

Responsibilities

Understand business objectives and operational challenges, and identify alternative technical solutions.
Perform studies and cost/benefit analyses to evaluate potential solutions.
Analyze user requirements, operational procedures, and workflow problems to identify opportunities for automation or improvement of computer systems.
Consult with personnel from different departments to understand current procedures, identify issues, and gather specific input and output requirements (e.g., data entry forms, reporting formats).
Write detailed descriptions of user needs, desired program functions, and the steps required to develop or modify computer programs.
Review computer system capabilities, technical specifications, and scheduling limitations to assess the feasibility of requested program changes.
Ensure the reliability, availability, performance, and scalability of production systems using software engineering practices.
Collaborate closely with development teams to design, build, and maintain resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Develop and implement automation tools to streamline manual and repetitive operational tasks.
Document processes, workflows, and system configurations to support ongoing operations and future enhancements.
Continuously monitor production systems, proactively addressing incidents and performance issues.
Participate in capacity planning and ongoing improvements to system resilience and scalability.
Maintain effective communication with executive management, business stakeholders, and cross-functional technical teams.
Stay current with emerging site reliability engineering practices, tools, and technologies.
Other duties as assigned.