Senior Platform Reliability Engineer

CDS Global•Dallas, TX

4d•Hybrid

About The Position

Platform Reliability Engineers (PREs) at Homecare Homebase ensure that our most critical healthcare services remain reliable, resilient, and high-performing at scale. Blending software engineering with systems operations, PREs focus on automation, observability, incident response, and the continuous reduction of toil across complex distributed platforms. This role calls for confident execution in high-stakes, high-visibility scenarios—particularly during major incidents—alongside proactive efforts to harden existing systems and improve service health over time. Ideal candidates are those who thrive in complex environments, take ownership of production reliability, and find purpose in creating systems that recover gracefully and support exceptional care delivery. Platform Reliability Engineers work closely with HCHB’s Architects, Product & Development teams, System Administrators, Platform Engineers, DBAs, and Product Support in the execution of their responsibilities.

Requirements

Bachelor’s degree in Computer Science, Systems Engineering, Math or related (equivalent experience considered) required.
3+ years experience in a 24x7 production enterprise-class environment as an SRE or comparable role.
3+ years Kubernetes administration/support in a production environment.
3+ years Azure or AWS PaaS, IaaS, and resource administration/support in a production environment.
Demonstrated composure and effectiveness in situations requiring rapid analysis, clear prioritization, and decisive action – particularly in incidents with significant business or customer impact.
Excellent problem solving and analytical skills with attention to detail and driving issues to resolution.
Experience solving problems via automation using orchestration platforms such as Ansible, Azure Automation, and ServiceNow Flows.
Proficient with scripting languages (multiple preferred): Bash, PowerShell, Python, and JavaScript.
Proficient with data tier languages: TSQL and GrpahQL.
Proficient with the following monitoring solutions (multiple preferred): Datadog, Splunk, Prometheus/Grafana, Application Insights, Azure Monitor, and Microsoft SCOM.
Proficient with modern SRE and Observability concepts (eg. OTEL, service level management, etc).

Nice To Haves

Academic coursework in Algorithms, Data Structures, Distributed Systems, and Information Security.
1+ year(s) serving as incident commander for major incidents.
Proficient with networking and troubleshooting (ie. addressing, routing, DNS, load balancing, mesh networking).
Ability to debug and optimize infrastructure as code pipelines using Ansible, Terraform, and Azure ARM.
Proficient with ITSM\ITIL practices such as service management, change management, incident management, and problem management particularly in ServiceNow.
Experience designing large-scale distributed systems.
Experience designing and developing software oriented towards systems or network automation.
Proficient with administration, automation, and orchestration of large-scale Windows and Linux environments using configuration management solutions such as DSC and Ansible.
Experience operating in large SQL databases with complex business logic.
Experience utilizing ML\AI technologies to accelerate your work.
Experience with Healthcare industry HIPAA regulations (similar regulated industry experience considered ie. PCI, SOX)
Experience working in an Agile and/or SAFe environment.
ITIL Foundations
Configuration: RHCE-Ansible
Kubernetes: CKA, KCSP
Linux: RHCE, CompTIA Linux+, GCUX, LPI
Microsoft\AWS: Administrator, DevOps Engineer

Responsibilities

Deliver solutions that enhance the overall reliability of the platform and/or reduce toil.
Establish modern observability patterns and implement those patterns.
Monitor the overall platform health as well as manage overall uptime and availability.
Evangelizes best practices and industry standards
Plan and implement modern SRE practices
Developing and aligning SLO/SLI, error budgets, capacity models to fulfill business needs
Operationalization of services including system testing, instrumentation, monitoring, capacity model development, training, and transition to operation teams.
Participate in the full project lifecycle from planning, implementation, operational readiness, to decommissioning.
Manage deployments of major releases.
Lead and coordinate resolution efforts during major incidents by serving as the incident commander.
Participate in an equitable 24x7 on-call rotation—serving as first responder for production alerts and escalation point for other teams.
Understand impact of technical implementation and processes to the business
Work with business owners to define SLAs in contracts
Present new designs and plans to Architectural Advisory Board for feedback
Plan and manage projects of the team
Builds relationships with peers, leads, and managers
Act as a technical leader that is a point of escalation, provide mentorship, and technical direction