Senior AI-Native DevOps / Operations Engineer (AMER)

Valency SystemsBerkeley, CA
Hybrid

About The Position

Valency Systems is seeking an AI-native DevOps / Operations Engineer to build and operate the platform behind Valency. This role involves designing and hardening production systems, improving CI/CD and release quality, enhancing reliability and response times, and creating necessary observability, analytics, and guardrails for a rapidly evolving platform. The position is at the intersection of platform engineering, cloud infrastructure, production operations, and AI-era software delivery, aiming to close the loop from agentically written software to reliable, performant production systems. This role is ideal for individuals experienced in scaling high-growth SaaS systems who enjoy building from first principles and wish to replicate that growth in a new environment. The team operates on a hybrid model, with 3 days in-person and 2 days remote.

Requirements

  • 8+ years of progressively increasing responsibility operating important production systems
  • Demonstrated success shipping and running high-reliability systems in production
  • Deep AWS experience in real production environments
  • Strong background in software engineering and testing, not just infrastructure administration
  • Experience designing or significantly improving CI/CD systems and release processes
  • Experience building or operating logging, monitoring, alerting, and observability systems
  • Experience improving production reliability, performance, and operational response
  • Comfort with container-based systems and orchestration platforms
  • Strong hands-on ability in at least some of: Python, Go, Elixir, CDK
  • Strong judgment around guardrails, operational safety, and change management
  • Ability to work in ambiguity and build systems that do not yet fully exist
  • Candidates must be legally authorized to work in the United States.

Nice To Haves

  • Startup experience, especially in fast-scaling environments
  • Experience at high-scale SaaS companies that have gone through periods of rapid growth
  • Experience owning or materially influencing platform engineering functions
  • Experience with cost engineering / FinOps in AWS-heavy environments
  • Experience designing systems for compliance-oriented environments
  • Experience with SOC 2, ISO 27001, or FedRAMP-related operational requirements
  • Experience evaluating or implementing modern observability and workflow tracing stacks
  • Experience creating human-in-the-loop approval systems for sensitive production workflows

Responsibilities

  • Design, build, and improve the production platform powering Valency
  • Tighten CI/CD processes for tested, gated, observable, and safe shipping of changes
  • Improve production reliability, latency, deployment safety, and incident response
  • Build operational feedback loops for engineering and product teams to act on production behavior
  • Establish logging, analytics, tracing, alerting, and workflow instrumentation as the platform scales
  • Define and implement guardrails for agent-involved software delivery and operations
  • Introduce human-in-the-loop approval flows for autonomy requiring stronger controls
  • Improve cost efficiency across cloud infrastructure and platform operations
  • Help shape security, compliance, and auditability foundations for SOC 2, ISO 27001, and FedRAMP-oriented environments
  • Contribute to long-term platform engineering direction
  • Own production operations and operational excellence
  • Lead incident response expectations
  • Establish the operating model for broader team scaling
  • Own and improve CI/CD pipelines, release controls, and deployment workflows
  • Build and maintain highly reliable AWS-based production systems
  • Improve observability across logs, metrics, traces, events, and workflow state
  • Instrument platform behavior for quick visibility and action on system issues, regressions, and slowdowns
  • Create operational analytics to close the loop between engineering, product, and customer experience
  • Drive cost engineering and infrastructure efficiency
  • Build safer operating patterns for agent-assisted code changes and operational actions
  • Implement testing, validation, approval, and rollback mechanisms to reduce operational risk
  • Improve batch, queue, cache, and job-processing reliability and monitoring
  • Support incident response, root cause analysis, postmortems, and follow-through
  • Partner with external vendors and partners
  • Help define platform standards, reliability practices, and operational maturity

Benefits

  • Competitive salary
  • Benefits
  • Meaningful equity
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service