Senior AI-Native DevOps / Operations Engineer (AMER)

Valency Systems•Berkeley, CA

5d•Hybrid

About The Position

Valency Systems is seeking an AI-native DevOps / Operations Engineer to build and operate the platform behind Valency. This role involves designing and hardening production systems, improving CI/CD and release quality, enhancing reliability and response times, and creating necessary observability, analytics, and guardrails for a rapidly evolving platform. The position is at the intersection of platform engineering, cloud infrastructure, production operations, and AI-era software delivery, aiming to close the loop from agentically written software to reliable, performant production systems. This role is ideal for individuals experienced in scaling high-growth SaaS systems who enjoy building from first principles and wish to replicate that growth in a new environment. The team operates on a hybrid model, with 3 days in-person and 2 days remote.

Requirements

8+ years of progressively increasing responsibility operating important production systems
Demonstrated success shipping and running high-reliability systems in production
Deep AWS experience in real production environments
Strong background in software engineering and testing, not just infrastructure administration
Experience designing or significantly improving CI/CD systems and release processes
Experience building or operating logging, monitoring, alerting, and observability systems
Experience improving production reliability, performance, and operational response
Comfort with container-based systems and orchestration platforms
Strong hands-on ability in at least some of: Python, Go, Elixir, CDK
Strong judgment around guardrails, operational safety, and change management
Ability to work in ambiguity and build systems that do not yet fully exist
Candidates must be legally authorized to work in the United States.

Nice To Haves

Startup experience, especially in fast-scaling environments
Experience at high-scale SaaS companies that have gone through periods of rapid growth
Experience owning or materially influencing platform engineering functions
Experience with cost engineering / FinOps in AWS-heavy environments
Experience designing systems for compliance-oriented environments
Experience with SOC 2, ISO 27001, or FedRAMP-related operational requirements
Experience evaluating or implementing modern observability and workflow tracing stacks
Experience creating human-in-the-loop approval systems for sensitive production workflows

Responsibilities

Design, build, and improve the production platform powering Valency
Tighten CI/CD processes for tested, gated, observable, and safe shipping of changes
Improve production reliability, latency, deployment safety, and incident response
Build operational feedback loops for engineering and product teams to act on production behavior
Establish logging, analytics, tracing, alerting, and workflow instrumentation as the platform scales
Define and implement guardrails for agent-involved software delivery and operations
Introduce human-in-the-loop approval flows for autonomy requiring stronger controls
Improve cost efficiency across cloud infrastructure and platform operations
Help shape security, compliance, and auditability foundations for SOC 2, ISO 27001, and FedRAMP-oriented environments
Contribute to long-term platform engineering direction
Own production operations and operational excellence
Lead incident response expectations
Establish the operating model for broader team scaling
Own and improve CI/CD pipelines, release controls, and deployment workflows
Build and maintain highly reliable AWS-based production systems
Improve observability across logs, metrics, traces, events, and workflow state
Instrument platform behavior for quick visibility and action on system issues, regressions, and slowdowns
Create operational analytics to close the loop between engineering, product, and customer experience
Drive cost engineering and infrastructure efficiency
Build safer operating patterns for agent-assisted code changes and operational actions
Implement testing, validation, approval, and rollback mechanisms to reduce operational risk
Improve batch, queue, cache, and job-processing reliability and monitoring
Support incident response, root cause analysis, postmortems, and follow-through
Partner with external vendors and partners
Help define platform standards, reliability practices, and operational maturity