Senior Site Reliability Engineer

OfficeSpace Software

52d

About The Position

You own the performance, reliability, and cost efficiency of OfficeSpace’s production platform at scale. As a Senior Site Reliability Engineer, you shape how our systems run—fast, resilient, and predictable—while leading the shift from manual operations to AI-assisted reliability engineering. We provide the platform. You make it perform.

Requirements

7+ years operating and evolving large-scale production systems.
Deep Linux systems expertise with hands-on performance tuning across CPU, memory, disk, and networking.
Strong Python skills for automation, tooling, and AI-assisted systems workflows.
Production experience with Ruby/Rails ecosystems, including Puma and Sidekiq.
Proven ability to diagnose and resolve complex database performance issues (MySQL/MariaDB or PostgreSQL).
Advanced Kubernetes experience—workload sizing, scheduling, and multi-tenant operations.
Infrastructure-as-code mastery using Terraform and Terragrunt.
Experience with configuration management tools such as Puppet or Ansible.
Strong observability instincts across metrics, logs, and traces using tools like Prometheus, Grafana, Datadog, or ELK.
AI fluency—comfortable supervising AI agents for analysis, testing, and reporting, and validating their outputs.
A builder mindset. You move fast, take ownership, and raise standards.

Nice To Haves

Scaling and refactoring monolithic applications under real production load
Extracting databases or stateful components from monoliths
Apache and Nginx tuning at scale
Redis performance optimization and operational management
CI/CD systems and GitOps workflows, including ArgoCD
Cloud cost optimization and FinOps-aligned operational practices

Responsibilities

Drive measurable improvements in latency, throughput, and availability across a large-scale production environment.
Own system performance—from Linux internals to Kubernetes scheduling—and eliminate bottlenecks before customers feel them.
Define and enforce SLIs, SLOs, and error budgets that balance speed, reliability, and growth.
Partner with application engineers to profile code paths, improve execution efficiency, and harden services under real load.
Lead database performance optimization across queries, indexing, replication, and workload isolation.
Design and oversee AI-assisted load testing, stress testing, and capacity planning workflows.
Guide the migration from monolithic deployments to multi-tenant Kubernetes platforms.
Reduce infrastructure spend through architectural decisions, right-sizing, and intelligent scaling strategies.
Build and supervise automation for infrastructure provisioning, configuration management, and observability.
Set clear operational standards for reliability, performance, and incident response—and raise the bar for how we run production.

Benefits

Competitive Benefits and Rewards: OfficeSpace offers comprehensive and competitive benefits packages globally, designed to support our team’s health, well-being, and financial security. We invest in our people so they can excel.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume