Site Reliability Engineer II

CME Group•New York, NY

1d•Hybrid

About The Position

CME Group is seeking a Site Reliability Engineer II (SRE II) to help build, operate, and scale systems within our Markets portfolio. This role focuses on products and applications related to CME’s Globex trading platform, known for its low-latency performance and high reliability during peak trading days. The SRE II will collaborate with senior engineers to enhance system observability, monitoring, automation, and overall production service reliability. There is a growing emphasis on integrating Artificial Intelligence (AI) and Machine Learning (ML) for predictive reliability and reducing operational tasks.

Requirements

A keen interest in SRE, automation, and intelligent operations (AIOps).
Experience with Linux-based systems.
Programming and scripting skills (Python, Bash, etc.).
Strong problem-solving and analytical abilities.
Excellent communication and teamwork skills.
Eagerness to learn and adapt in a fast-paced trading environment.

Nice To Haves

Demonstrated hands-on experience applying AI/ML techniques to improve operational efficiency, reliability, or observability.
Experience using AIOps platforms such as Dynatrace, New Relic, Moogsoft, BigPanda, or integrating open-source tools (e.g., Prometheus with ML models).
Experience with LLMs for operations, incident management, or log analysis (e.g., using LangChain, LlamaIndex, or tools like PagerDuty AIOps).
Experience with Cloud-based platforms—Google Cloud Platform (GCP), GCE, and/or GKE is a strong bonus.
Experience with metrics & monitoring tools like OpenTelemetry, Splunk, Prometheus, and Grafana.
Experience with Kubernetes and knowledge of working with distributed systems.
Basic knowledge of networking (HTTP/TCP/UDP/IP) and message-oriented middleware.
Experience in financial markets and working in an Agile environment.

Responsibilities

Work alongside product teams and senior engineers to assist with building out observability, monitoring, and alerting for key services.
Implement AI-driven reliability solutions, including anomaly detection, predictive alerting, and root cause analysis in production environments.
Collaborate with engineers and product teams to ensure requirements are understood, planned carefully, and implemented safely.
Participate in on-call rotation and assist in incident response under guidance from senior engineers.
Write scripts and tools to reduce toil and improve velocity, including building or integrating intelligent auto-remediation and capacity forecasting systems.
Leverage LLMs and Generative AI to enhance incident management, automate runbooks, and streamline log analysis.
Contribute to disaster recovery (DR) and systems resiliency testing & improvements.
Support the migration of markets applications to Google Cloud Platform (GCP).
Collaborate with cross-functional teams to improve system performance and operational efficiency.

Benefits

Competitive compensation and benefits package.
Annual target bonus opportunity.
Opportunity to become an owner in the company through our broad-based equity program.
Comprehensive health coverage.
Retirement package that includes both a 401(k) and an active pension plan.
Highly competitive education reimbursement provisions.
Paid time off.
Mental health benefit.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume