Site Reliability Engineer - 7 Month Contract

Orion Health•Dallas, TX

About The Position

As a Site Reliability Engineer, you will play a pivotal role in ensuring the reliability, availability and performance of our cloud infrastructure and operating systems in mission critical client solutions around the globe. To do this you will design, manage and execute the upgrade and maintenance schedule for a defined list of clients. You will work ongoingly to automate infrastructure processes, implement best practices and introduce new approaches and tools that enhance our software delivery pipeline and reliability and performance of live client solutions. Success is working proactively to predict client needs, increase efficiencies and ultimately increase customer satisfaction and reduce the number and severity of support incidents. This means exceeding our SLAs and SLOs. You will produce upgrade and maintenance plans for all clients under your responsibility, and work with your team and client contacts to deliver to the plan on time. You will implement and review infrastructure monitoring and observability tools, identifying planning and delivering initiatives that deliver business and client value and reduce risk. The Orion Health Tech Ops group exists to exceed client expectations in the maintenance and improvement of their Orion Health solutions. The Operations division succeeds together, so strong collaboration will be required with all other roles across the Tech Ops and Service Mgmt groups. Wider internal key relationships will be developed with Product, Delivery and Solutions teams.

Requirements

Minimum of 3 years proven experience in a similar infrastructure delivery or maintenance/management role.
Mid-Sr level skills in AWS cloud tools (provisioning, backup, monitoring, FinOps etc)
Intermediate level skills in automation and Infrastructure as Code (IaC) tools (e.g. Ansible, Cloudformation etc)
Mid-Sr level skills in Linux operating systems administration
Mid-Sr level skills in infrastructure monitoring and observability tools (Grafana and Prometheus a bonus)
Network (DNS, DHCP, Firewall, Routing, and Cisco Switch)
VMWare Virtualization and SAN Storage
IT Security Management
Excellent English verbal and written communication skills.
Ability to think logically and analytically in a problem solving environment

Nice To Haves

Grafana and Prometheus a bonus

Responsibilities

Design, implement, and maintain systems and infrastructure to ensure the reliability and availability of our applications above our SLAs and SLOs.
Develop and implement strategies to mitigate and prevent system failures.
Respond promptly to infrastructure incidents escalated from our Tier 2 Support Analysts, diagnose root causes, and implement corrective actions to minimise downtime and ensure service continuity.
Conduct post-incident reviews to analyse and improve system reliability.
Implement robust monitoring solutions to track system health, performance, and potential issues.
Set up alerting mechanisms to proactively address potential problems before they impact users.
Automate routine operational tasks to improve efficiency and reduce manual intervention.
Implement Infrastructure as Code principles to ensure consistency and scalability.
Collaborate with development and Ops teams to optimise solution performance and resource utilisation.
Identify and address bottlenecks to improve overall system efficiency.
Conduct capacity assessments and plan for scalable infrastructure to meet growing business needs.
Work with cross-functional teams to ensure infrastructure supports current and future requirements.
Plan, coordinate, and execute infrastructure and operating system upgrades into client solutions, ensuring timely and high-quality delivery.