Senior Consultant - SRE Architect

QodeArlington, TX

About The Position

Incedo is a global AI and data transformation specialist empowering companies to realize sustainable business impact from their digital investments by delivering ROI from AI@Scale. As a long-term partner for strategy to execution, we operate at the intersection of business and technology. Our integrated services and platforms are built on the foundation of AI & Data, digital engineering, and operations transformation, bringing deep domain expertise and full stack capabilities together. With over 4,000 people in the US, Canada, Latin America and India and a large, diverse portfolio of Fortune 500 enterprises and fast-growing clients worldwide, we work across banking & payments, wealth management, telecom, hi-tech and life sciences. We are seeking a highly experienced Senior Consultant / SRE Architect to lead the strategy, design, and implementation of enterprise-wide observability and reliability frameworks supporting business-critical transaction flows across distributed systems. In this role, you will act as a thought leader and architect, driving end-to-end transaction visibility, resilience, and performance optimization across microservices, APIs, databases, and third-party integrations. You will partner with engineering, architecture, and business stakeholders to define standards, influence technical direction, and implement scalable observability solutions. This is a high-impact role focused on transforming SRE maturity, improving advisor experience, and enabling proactive, data-driven operations through modern observability practices. The ideal candidate is passionate about SRE, observability, and system design, with a proven ability to drive large-scale transformation initiatives.

Requirements

  • 10+ years of experience in SRE, Observability, or related roles, with a strong focus on architecture and strategy
  • Deep hands-on expertise with observability platforms such as Dynatrace, ELK, Datadog, Splunk, OpenTelemetry, Jaeger
  • Proven experience designing observability solutions in cloud environments (AWS, Azure, GCP)
  • Strong understanding of microservices architecture, APIs, and distributed systems
  • Proficiency in programming/scripting (e.g., Python, Go, Java) for automation and integration
  • Demonstrated ability to lead cross-functional initiatives and influence technical direction

Nice To Haves

  • Dynatrace Associate or Professional Certification
  • Experience implementing OpenTelemetry standards at scale
  • Strong background in chaos engineering and resiliency testing
  • Familiarity with AIOps platforms and intelligent automation solutions
  • Consulting experience or prior role as an architect / technical advisor

Responsibilities

  • Define and lead the enterprise observability strategy for end-to-end transaction traceability across distributed systems
  • Architect scalable solutions leveraging tools such as Dynatrace, OpenTelemetry, ELK, Grafana, Datadog, Splunk, Jaeger
  • Establish standardized frameworks for logging, metrics, tracing, and telemetry collection
  • Design and implement dependency mapping and service topology visualization across complex ecosystems
  • Provide architectural guidance for monitoring latency, throughput, and error rates across critical transaction paths
  • Lead root cause analysis using distributed tracing and telemetry data to resolve systemic performance issues
  • Partner with application and database teams to optimize system performance and scalability
  • Drive adoption of performance engineering best practices across teams
  • Define and implement resiliency strategies for business-critical transaction flows
  • Architect fault-tolerant systems, including failover, redundancy, and self-healing mechanisms
  • Lead and design chaos engineering initiatives to validate system resilience
  • Establish and govern Service Level Objectives (SLOs) and Service Level Indicators (SLIs) aligned to business outcomes
  • Act as a trusted advisor to engineering teams, architects, and leadership on observability and SRE best practices
  • Define and enforce standards, policies, and governance models for monitoring and tracing
  • Lead cross-functional initiatives to drive adoption of observability frameworks
  • Mentor engineers and SRE teams, fostering a culture of continuous improvement and operational excellence
  • Drive measurable improvements including: 30% reduction in MTTD and MTTR within the first year ≥70% root cause identification within 1 hour ≥90% proactive issue detection via monitoring systems
  • Develop executive-level reporting on system health, reliability trends, and performance metrics
  • Build reusable frameworks, accelerators, and playbooks for incident management and observability adoption
  • Establish comprehensive documentation for transaction flows, system dependencies, and observability architectures
  • Develop and standardize incident response playbooks and runbooks
  • Lead training and enablement initiatives to scale observability expertise across teams
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service