Site Reliability Engineer (SRE) - Data Center & Infrastructure

Exegy•St. Louis, MO

27d

About The Position

Exegy is seeking a highly motivated and detail-oriented Site Reliability Engineer (SRE) to support and enhance the reliability, scalability, and performance of our global data center and hybrid infrastructure environments. This role blends software engineering, systems engineering, automation, and operational rigor to ensure high-availability services powering Exegy's mission-critical market data products and internal platforms. As an SRE, you will own and improve operational processes, expand automation, strengthen observability, support capacity planning, and design systems that gracefully handle failure with minimal business impact. You will collaborate across Infrastructure, Network Engineering, Security, and DevOps teams to deliver resilient, secure, and scalable platforms.

Requirements

Bachelor’s degree in Computer Science, Engineering, or equivalent experience
5+ years in Site Reliability Engineering, Systems Engineering, or Infrastructure Operations
Hands-on experience with VMware, Hyper-V, or similar virtualization technologies
Strong Linux and Windows server administration background
Experience with on-prem data centers, hardware lifecycle, and networking
Proficiency in automation and scripting (PowerShell, Bash, Python, Ansible, Terraform)
Experience with monitoring, logging, and observability platforms
Ability to participate in on-call rotation and support critical incidents

Nice To Haves

Familiarity with AWS, Azure, or GCP in hybrid environments

Responsibilities

Maintain and improve uptime across core systems including compute, storage, virtualization, load balancers, and data center network infrastructure
Support production services across on-prem data centers, co-locations, and hybrid cloud environments
Participate in 24×7 on-call rotation, major incident response, and post-mortems
Lead root cause analysis (RCA) and drive long-term remediation plans
Identify system failure patterns and implement hardening strategies
Develop and maintain automation using Ansible, Terraform, PowerShell, Python, Puppet, or similar tools
Automate operational workflows, configuration management, deployments, and failover testing
Implement and improve Infrastructure-as-Code (IaC) for consistency and reduced drift
Build and enhance monitoring across systems, networks, and applications (Prometheus, Grafana, Datadog, New Relic, SolarWinds, Splunk, etc.)
Improve alert fidelity, create health dashboards, and expand log aggregation
Conduct proactive performance tuning across hardware, virtualization, and OS layers (Windows/Linux)
Support physical and virtual data center infrastructure including racking/stacking, cabling, hardware lifecycle, and capacity planning
Own patching, firmware upgrades, refresh cycles, and vendor coordination
Support DR/BCP testing, multi-site failover architecture, and replication strategies
Maintain secure baseline configurations aligned to CIS Benchmarks, NIST, and ISO standards
Partner closely with Network, Security, DevOps, and Application Engineering teams to improve reliability end-to-end
Influence architecture decisions regarding capacity, resiliency, and scalability
Create and maintain runbooks, playbooks, standards, and operational documentation
Implement and maintain security controls including MFA, encryption, logging, PAM, and patch compliance
Support audit requirements for SOC 2, ISO 27001, CIS Controls, and internal governance obligations
Participate in vulnerability remediation efforts and system hardening