Senior Consultant - SRE Architect

Qode•Arlington, TX

About The Position

Incedo is a global AI and data transformation specialist empowering companies to realize sustainable business impact from their digital investments by delivering ROI from AI@Scale. As a long-term partner for strategy to execution, we operate at the intersection of business and technology. Our integrated services and platforms are built on the foundation of AI & Data, digital engineering, and operations transformation, bringing deep domain expertise and full stack capabilities together. With over 4,000 people in the US, Canada, Latin America and India and a large, diverse portfolio of Fortune 500 enterprises and fast-growing clients worldwide, we work across banking & payments, wealth management, telecom, hi-tech and life sciences. We are seeking a highly experienced Senior Consultant / SRE Architect to lead the strategy, design, and implementation of enterprise-wide observability and reliability frameworks supporting business-critical transaction flows across distributed systems. In this role, you will act as a thought leader and architect, driving end-to-end transaction visibility, resilience, and performance optimization across microservices, APIs, databases, and third-party integrations. You will partner with engineering, architecture, and business stakeholders to define standards, influence technical direction, and implement scalable observability solutions. This is a high-impact role focused on transforming SRE maturity, improving advisor experience, and enabling proactive, data-driven operations through modern observability practices. The ideal candidate is passionate about SRE, observability, and system design, with a proven ability to drive large-scale transformation initiatives.

Requirements

10+ years of experience in SRE, Observability, or related roles, with a strong focus on architecture and strategy
Deep hands-on expertise with observability platforms such as Dynatrace, ELK, Datadog, Splunk, OpenTelemetry, Jaeger
Proven experience designing observability solutions in cloud environments (AWS, Azure, GCP)
Strong understanding of microservices architecture, APIs, and distributed systems
Proficiency in programming/scripting (e.g., Python, Go, Java) for automation and integration
Demonstrated ability to lead cross-functional initiatives and influence technical direction

Nice To Haves

Dynatrace Associate or Professional Certification
Experience implementing OpenTelemetry standards at scale
Strong background in chaos engineering and resiliency testing
Familiarity with AIOps platforms and intelligent automation solutions
Consulting experience or prior role as an architect / technical advisor

Responsibilities

Define and lead the enterprise observability strategy for end-to-end transaction traceability across distributed systems
Architect scalable solutions leveraging tools such as Dynatrace, OpenTelemetry, ELK, Grafana, Datadog, Splunk, Jaeger
Establish standardized frameworks for logging, metrics, tracing, and telemetry collection
Design and implement dependency mapping and service topology visualization across complex ecosystems
Provide architectural guidance for monitoring latency, throughput, and error rates across critical transaction paths
Lead root cause analysis using distributed tracing and telemetry data to resolve systemic performance issues
Partner with application and database teams to optimize system performance and scalability
Drive adoption of performance engineering best practices across teams
Define and implement resiliency strategies for business-critical transaction flows
Architect fault-tolerant systems, including failover, redundancy, and self-healing mechanisms
Lead and design chaos engineering initiatives to validate system resilience
Establish and govern Service Level Objectives (SLOs) and Service Level Indicators (SLIs) aligned to business outcomes
Act as a trusted advisor to engineering teams, architects, and leadership on observability and SRE best practices
Define and enforce standards, policies, and governance models for monitoring and tracing
Lead cross-functional initiatives to drive adoption of observability frameworks
Mentor engineers and SRE teams, fostering a culture of continuous improvement and operational excellence
Drive measurable improvements including: 30% reduction in MTTD and MTTR within the first year ≥70% root cause identification within 1 hour ≥90% proactive issue detection via monitoring systems
Develop executive-level reporting on system health, reliability trends, and performance metrics
Build reusable frameworks, accelerators, and playbooks for incident management and observability adoption
Establish comprehensive documentation for transaction flows, system dependencies, and observability architectures
Develop and standardize incident response playbooks and runbooks
Lead training and enablement initiatives to scale observability expertise across teams