Site Reliability / DevOps Engineer

eClerx•Raleigh, NC

12h•$120,000 - $137,500•Onsite

About The Position

eClerx is seeking a motivated SRE/DevOps Engineer with strong observability experience to join our growing Platform Engineering team. This team is responsible for managing cloud infrastructure, advancing DevOps practices, improving platform reliability, and supporting highly available enterprise applications. The ideal candidate will have a deep understanding of cloud-native architectures, distributed systems, CI/CD automation, observability frameworks, and site reliability engineering principles. This individual will play a key role in improving platform resilience, operational efficiency, and system performance across a modern cloud-based technology ecosystem.

Requirements

5+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
5+ years of work experience with Public Cloud (Azure (preferred)or AWS)
3+ years of hands-on experience with observability platforms such as Datadog, Elasticsearch, Grafana, or similar solutions.
5+ years of experience with scripting languages like Python, Bash, Powershell, etc.
3+ years of experience with containerization and orchestration technologies, including Docker and Kubernetes.
2+ years of experience developing and managing CI/CD pipelines using tools such as Azure DevOps, GitLab CI/CD, GitHub Actions, Jenkins, or similar.
2+ years of experience with Infrastructure-as-Code (IaC) tools such as Terraform, Azure Bicep, AWS CloudFormation, or equivalent technologies.
1+ years of experience using site reliability and resilience testing tools such as Gremlin, Chaos Mesh, or similar platforms.
Proven experience leveraging observability best practices, end-user monitoring, application performance monitoring, and infrastructure monitoring solutions.
Experience with event streaming and messaging platforms such as Kafka or Azure Event Hubs.
Strong understanding of Linux operating systems and administration.

Nice To Haves

Kubernetes certification
Cloud platform certifications (Azure, AWS, or GCP).
Experience working in Azure environments and/or Azure DevOps.
Experience implementing and managing Datadog or other modern observability platforms.
Experience supporting enterprise-scale applications within financial services, capital markets, fintech, or other highly regulated industries.

Responsibilities

Design, implement, and enhance system observability and monitoring solutions.
Monitor system performance, create incident response plans, and implement observability practices to gain deeper insights into system behavior.
Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Improve platform reliability, scalability, and resiliency.
Conduct post-incident reviews and implement corrective actions to prevent recurring issues.
Partner with engineering teams to implement observability tooling and leverage telemetry data to troubleshoot and resolve incidents.
Utilize observability and event management capabilities to improve key operational metrics, including Mean Time to Detect (MTTD) and Mean Time to Restore (MTTR).
Continuously optimize infrastructure, architecture, automation, CI/CD processes, and operational workflows.
Collaborate closely with software engineers to ensure applications are designed and deployed following DevOps and reliability best practices.
Participate in a rotating on-call schedule, including support for production releases and critical incidents outside normal business hours when required.