Order Management System (OMS) Staff Engineer

Levi Strauss•San Francisco, CA

1d•Hybrid

About The Position

As a Staff Engineer on the Order Management System (OMS) team, you will be an important technical voice. You will shape how we build, operate, and evolve a mission-critical platform. This platform powers commerce at Levi Strauss & Co. You will bring deep software engineering fundamentals to bear on hard problems—designing systems built for scale and untangling complexity so our platform can move faster with confidence. You operate at the intersection of engineering craft and real-world production ownership: you build it, you run it, you make it better.

Requirements

10+ years of experience in software engineering with a focus on backend systems, distributed architectures, and platform/product engineering at scale.
Deep, practical experience designing and modeling complex distributed systems—you articulate trade-offs and make well-reasoned architectural choices under constraints.
You have experience operating in a " you build it, you run it " engineering culture.
You've been on-call for systems you've built, responded to incidents, and used that experience to make better engineering decisions.
Build for scale and run at scale—you've handled high-throughput, high-availability systems and have the scars and lessons to show for it.
Expert-level understanding of observability: you can instrument a system from scratch, build meaningful dashboards, tune alerting, and use telemetry data as a primary tool for engineering decisions.
Troubleshoot with a systematic, data-driven approach to diagnosing production issues—you stay calm and lead others when systems are on fire.
Demonstrated experience decoupling tightly-coupled systems—whether migrating a monolith, extracting a shared service, or replacing implicit temporal dependencies with well-defined async contracts.
Experience with event-driven architecture, domain-driven design, and modern API design patterns; you know where these patterns add value and where they add unnecessary complexity.
Mastery of CI/CD, automated testing, and DevOps practices; you view them as engineering fundamentals, not optional add-ons.
You can translate technical complexity for non-technical partners and write for engineering audiences—design docs, ADRs, incident reports, and code reviews all reflect your thinking.
Experience working with geographically distributed teams and navigating the complexities of multi-time zone collaboration.

Nice To Haves

Experience with Order Management Systems (OMS), fulfillment pipelines, or commerce platforms is a meaningful plus—familiarity with the domain accelerates your impact, but is not a prerequisite for the right engineer.

Responsibilities

Lead the design and domain modeling of complex, distributed systems within the OMS ecosystem. This produces clear, well-reasoned service boundaries, data contracts, and event-driven interaction patterns that stand up to scrutiny and scale.
Champion domain-driven design (DDD) principles, working with product and engineering peers to identify bounded contexts, eliminate implicit coupling, and surface shared language across teams.
Guide decomposition of monolithic or tightly-coupled components into well-defined, independently deployable services—reducing blast radius, improving team autonomy, and promoting faster iteration.
Author architecture decision records (ADRs) and technical design documents that communicate the "why" alongside the "what," helping teams make decisions over time.
Write, review, and guide production-quality code with an emphasis on clarity, testability, and long-term maintainability—setting the bar for engineering craft on the team.
Apply modern software engineering practices: CI/CD pipelines, automated testing strategies, feature flagging, progressive delivery, and trunk-based development.
Identify and eliminate technical debt systematically, balancing short-term velocity with long-term system health through well-argued, incremental improvement plans.
Establish and promote coding standards, patterns, and best practices across the OMS team that are practical, enforceable, and grounded in production experience.
Operate with full production: you design with failure in mind, participate in on-call rotations, and take accountability for the health and reliability of the systems you ship.
Embed reliability engineering into the development lifecycle—defining SLOs, error budgets, and reliability targets upfront rather than as an afterthought.
Treat runbooks, strategies, and operational documentation as first-class engineering artifacts, keeping them accurate, applicable, and tightly coupled to the systems they describe.
Design and implement comprehensive observability strategies—structured logging, distributed tracing, and metrics—so that you can localize any failure mode in production.
Develop dashboards that give engineers, on-call responders, and partners genuine operational insight into system health—not just uptime pings, but meaningful golden signals and business-relevant Goals.
Define and tune alerting strategies that are signal-rich and noise-poor—ensuring you wake on-call engineers for relevant events, not symptoms of unrelated upstream noise.
Champion observability as a design constraint, ensuring you instrument new services and that you make telemetry quality part of every code review and launch checklist.
Design systems that can sustain peak commercial volumes—seasonal traffic spikes, flash sales, and global expansion—without degraded experience or unplanned downtime.
Apply scalability patterns: asynchronous messaging, event sourcing, CQRS, caching strategies, database sharding, and graceful degradation, selecting the right tool for each problem.
Conduct and lead capacity planning exercises, load testing, and performance profiling—translating production data into informed infrastructure and architectural decisions.
Be the senior technical resource during complex production incidents—methodically narrowing hypotheses, leading war rooms, and restoring service while preserving forensic evidence for root cause analysis.
Facilitate blameless post-incident reviews (PIRs) that produce durable improvements—not just immediate fixes, but systemic changes that reduce the likelihood or impact of recurrence.
Develop institutional troubleshooting knowledge: document failure modes, known issues, and diagnostic techniques so the entire team grows more capable with each incident.
Partner with product managers, architects, and other engineers to translate our requirements into clear, achievable technical roadmaps—bridging the gap between strategy and implementation.
Mentor and level up mid-level engineers through hands-on code review, design feedback, pairing sessions, and direct coaching—building engineering depth across the OMS team.
Stay current with industry trends in distributed systems, event-driven architecture, and operational tooling—bringing informed perspectives on when to adopt new approaches versus doubling down on patterns.