Senior Site Reliability Engineer

Order.co
$175,000 - $200,000Hybrid

About The Position

As a Senior Site Reliability Engineer on the Platform team, you will ensure that software systems are reliable, scalable, performant, and operationally efficient. You blend software engineering skills with infrastructure and operations expertise to keep critical systems running smoothly while enabling rapid product development.

Requirements

  • Strong foundation in computer science fundamentals: data structures, algorithms, and system design
  • Familiarity with building production-grade applications and services using Ruby and Ruby on Rails
  • Deep expertise with Linux systems administration and production troubleshooting
  • Strong experience operating cloud infrastructure at scale, particularly within AWS environments
  • Experience with Kubernetes, container orchestration, and cloud-native infrastructure patterns
  • Proficiency with infrastructure as code tools such as Terraform or CloudFormation
  • Expertise designing and operating CI/CD pipelines and deployment automation systems
  • Deep understanding of observability tooling including Datadog, OpenTelemetry, or similar platforms
  • Strong knowledge of distributed systems reliability patterns including redundancy, failover, autoscaling, rate limiting, and graceful degradation
  • Experience building automation and operational tooling using languages such as Python, Go, Bash, or Ruby
  • Strong understanding of networking fundamentals including DNS, load balancing, TLS, VPNs, firewalls, and service discovery
  • Hands-on experience with incident response, root-cause analysis, and production operations in high-availability environments
  • Familiarity with SRE methodologies including SLOs, SLIs, error budgets, capacity planning, and operational maturity modeling
  • Experience implementing secure infrastructure and cloud security best practices including IAM, secrets management, and vulnerability remediation
  • Proven ability to design scalable, resilient, and maintainable platform systems and APIs
  • Experience supporting distributed microservices architectures and event-driven systems
  • Strong understanding of operational excellence principles including automation-first engineering and toil reduction
  • Experience using AI-assisted engineering tools (e.g., Claude, GitHub Copilot) as force multipliers while applying sound operational and engineering judgment
  • Excellent debugging and systems thinking skills across infrastructure, networking, application, and platform layers
  • You are motivated by accountability — you own outcomes, not just tasks
  • You are results-oriented and measure success by shipped, working software
  • You are motivated by correctness in code that touches money — the consequences of a bug land on real customer balances, and you take that seriously
  • You love helping people on your team grow and improve
  • Writing tests is an integral part of your development process, not an afterthought
  • You know how to design and build software incrementally — you don't need a complete spec to make progress
  • Collaborating with the people around you to achieve a goal motivates you
  • You are collaborative, open-minded, and actively developing your craft
  • You are curious and pragmatic about AI-driven solutions — you apply them where they add real value and stay skeptical where they don't
  • Familiarity with AI-assisted development tools — you understand how they work, where they help, and where they fail. Prior hands-on use is a plus; intellectual curiosity and the instinct to evaluate AI output critically are what matter

Nice To Haves

  • Reliable delivery of complex work — consistently ships multi-part solutions on time with low defect rates
  • Low defects in owned areas — proactively monitors and improves the quality of the systems they own; that means incident-free quarters in code paths that move funds and clean reconciliation against vendor reports
  • Measurable mentorship impact — engineers around you write better code because of your reviews and guidance

Responsibilities

  • Design, build, and operate highly available, scalable, and fault-tolerant infrastructure and platform services
  • Own reliability, availability, latency, and operational excellence for critical production systems and services
  • Define and maintain service level objectives (SLOs), service level indicators (SLIs), and error budgets across platform systems
  • Lead incident response efforts for complex production outages; drive root-cause analysis and long-term remediation actions
  • Build resilient systems that gracefully handle failures, traffic spikes, dependency degradation, and regional outages
  • Continuously improve system reliability through automation, observability, performance tuning, and capacity planning
  • Develop infrastructure automation and self-service tooling to reduce operational toil and improve engineering velocity
  • Build and maintain CI/CD pipelines, deployment automation, and release engineering workflows
  • Implement infrastructure as code (IaC) practices using tools such as Terraform, CloudFormation, and container orchestration
  • Improve developer experience by building reliable internal platforms, operational tooling, and standardized deployment patterns
  • Drive adoption of GitOps, immutable infrastructure, and automated remediation patterns
  • Design and maintain comprehensive monitoring, logging, tracing, and alerting systems for distributed services
  • Establish actionable alerting standards that reduce noise while improving incident detection and response times
  • Analyze production trends, system bottlenecks, and failure patterns to proactively prevent incidents
  • Lead operational readiness reviews, disaster recovery planning, and game-day exercises
  • Improve mean time to detect (MTTD) and mean time to recovery (MTTR) through tooling, automation, and process refinement
  • Participate actively in architecture and infrastructure design reviews
  • Propose scalable and reliable platform designs that account for multi-region deployment, redundancy, failover, and security considerations
  • Evaluate trade-offs between reliability, scalability, operational complexity, and engineering velocity
  • Identify systemic risks and operational gaps before they become production incidents
  • Partner with engineering teams to ensure services are designed with operability, observability, and resilience in mind from day one
  • Approach infrastructure and operational practices with a strong security mindset
  • Implement and maintain secure cloud networking, secrets management, IAM policies, and infrastructure hardening standards
  • Partner with Security and Compliance teams to ensure systems meet organizational and regulatory requirements
  • Drive operational best practices around vulnerability management, patching, and production access controls
  • Scope and estimate infrastructure and reliability initiatives accurately
  • Coordinate production rollouts, maintenance events, and reliability improvements across teams
  • Communicate operational risks, dependencies, and incident impacts clearly to technical and non-technical stakeholders
  • Collaborate closely with Software Engineering, Security, Product, and Operations teams to improve platform reliability and scalability
  • Serve as a trusted escalation point during critical production incidents
  • Mentor junior and mid-level engineers on reliability engineering principles, operational excellence, and infrastructure best practices
  • Raise the operational maturity of the engineering organization through documentation, reviews, and technical guidance
  • Influence technical decisions through credibility, operational expertise, and strong engineering judgment

Benefits

  • Competitive compensation including base salary, bonus, and equity
  • Employer-sponsored 401(k) with match
  • Comprehensive medical, dental, and vision coverage
  • Flexible time off and hybrid work environment
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service