Senior Site Reliability Engineer

Qode•Arlington, TX

12h

About The Position

Incedo is a global AI and data transformation firm helping organizations drive measurable business impact from digital investments. We operate at the intersection of business and technology, combining AI, data, and digital engineering to deliver scalable, high-impact solutions.With over 4,000 professionals across the U.S., Canada, Latin America, and India, Incedo partners with Fortune 500 and high-growth organizations across banking, payments, wealth management, telecom, and life sciences. Role OverviewWe are seeking a Senior Site Reliability Engineer (SRE) to drive reliability, observability, and performance across business-critical distributed systems.This is a hands-on engineering role with strong ownership, focused on building and scaling observability platforms, improving transaction visibility, and enhancing system resilience. You will work closely with engineering, platform, and infrastructure teams to ensure high availability, performance, and operational excellence across microservices, APIs, and cloud-native systems.The ideal candidate combines deep technical expertise in SRE practices with a passion for automation, monitoring, and continuous improvement.

Requirements

7–10+ years of experience in Site Reliability Engineering or Production Support Engineering
Strong hands-on experience with observability tools (Dynatrace, Datadog, Splunk, ELK, Grafana, OpenTelemetry, Jaeger)
Experience supporting cloud-native environments (AWS, Azure, or GCP)
Deep understanding of microservices architecture and distributed systems
Proficiency in scripting/programming (Python, Go, Java, or similar)
Experience with monitoring, alerting, and incident management in production environments

Nice To Haves

Experience implementing OpenTelemetry at scale
Background in chaos engineering and resiliency testing
Familiarity with AIOps or intelligent monitoring platforms
Experience in financial services, banking, or wealth management environments
Dynatrace certification (Associate or Professional)

Responsibilities

Design, implement, and maintain observability solutions across distributed systems
Build and optimize logging, metrics, and tracing pipelines using tools like Dynatrace, Datadog, Splunk, ELK, Grafana, and OpenTelemetry
Enable end-to-end transaction tracing across microservices and APIs
Develop dashboards and alerting strategies for proactive issue detection
Own service reliability, uptime, and operational performance for critical systems
Lead incident response, root cause analysis (RCA), and postmortems
Reduce MTTD and MTTR through automation and improved observability
Create and maintain runbooks and incident response playbooks
Monitor and optimize system performance (latency, throughput, error rates)
Partner with application and database teams to troubleshoot bottlenecks
Use distributed tracing and telemetry data to identify and resolve issues
Implement performance testing and tuning strategies
Build and maintain fault-tolerant, highly available systems
Implement resiliency patterns (failover, retries, circuit breakers, self-healing)
Drive chaos engineering practices to validate system reliability
Automate operational tasks using scripting (Python, Go, etc.)
Define and enforce SLOs, SLIs, and error budgets aligned to business goals
Promote SRE principles across engineering teams
Partner with DevOps and platform teams to improve CI/CD reliability
Contribute to building a culture of operational excellence and accountability