Staff Engineer, CI/CD & Cloud Infrastructure

Foresite Labs (Stealth Co)•San Diego, CA

59d

About The Position

We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C++, and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility. This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable.

Requirements

7+ years of experience in DevOps, CI/CD, or cloud infrastructure roles
Strong, hands-on Linux expertise (administration, debugging, performance tuning)
Deep experience designing and operating CI/CD pipelines (GitHub Actions preferred)
Proven experience managing complex AWS infrastructure at scale
Strong knowledge of Docker including multi-stage builds, registries, and orchestration
Experience with infrastructure as code using Terraform
Experience with Kubernetes and Helm for container orchestration
Solid understanding of versioning strategies, artifact management, and release engineering
Experience integrating agentic AI into DevOps workflows and CI/CD pipelines
Proficiency in Python and shell scripting for automation and tooling
Ability to read, debug, and build C/C++ and CUDA applications on Linux
Experience integrating build systems (CMake, Make) into CI pipelines
Familiarity with package management and dependency resolution across languages
Deep AWS experience across IAM, networking (VPC, security groups), storage, compute, and database services
Experience managing on-premises Linux HPC infrastructure alongside cloud resources
Experience designing for high availability, failover, and disaster recovery
Experience with data pipeline and workflow orchestration tools (Step Functions, Airflow, or similar)
Experience with search and indexing platforms (Elasticsearch, OpenSearch, or similar)
Understanding of tiered storage strategies and data lifecycle management
Knowledge of cost management, tagging strategies, and infrastructure governance
Experience with logging and monitoring stacks (Prometheus, Grafana, Loki, ELK, or CloudWatch)
Understanding of build and artifact traceability practices
Experience with structured logging and distributed tracing concepts

Nice To Haves

Experience deploying software to embedded or lab-operated instruments
Experience with high-speed networking (InfiniBand, RDMA, or 10/25/100GbE) in HPC environments
Experience with CUDA build toolchains and GPU-accelerated workloads
Familiarity with Azure or GCP in addition to AWS
Experience in regulated or reliability-sensitive environments
Experience with GitOps workflows and progressive delivery strategies
Familiarity with secrets management (Vault, AWS Secrets Manager)

Responsibilities

Design, build, and maintain CI/CD pipelines using GitHub Actions or similar platforms
Manage build systems for Python, C/C++, and CUDA codebases on Linux
Integrate build tools (CMake, Make, pip, setuptools) into automated pipelines
Implement robust versioning, tagging, and artifact management strategies
Ensure full traceability of builds, test results, and artifacts from commit to deployment
Manage Docker-based build environments including base images, caching, and reproducibility
Maintain and optimize build performance, parallelism, and reliability
Architect and manage complex AWS infrastructure including:
IAM roles, policies, and access management
Storage services (S3, EBS, EFS) with tiered lifecycle policies
Databases (RDS, DynamoDB, or similar) with backup and failover strategies
Data workflow and pipeline engines (Step Functions, Airflow, or similar)
Compute services (EC2, ECS, EKS, Lambda) scaled to workload requirements
Implement infrastructure as code using Terraform
Manage Kubernetes clusters and Helm charts for containerized workloads
Design for scalability, high availability, and disaster recovery
Manage cost optimization, resource tagging, and infrastructure governance
Support multi-account and multi-region strategies as needed
Familiarity with Azure and GCP for secondary or hybrid requirements
Provision, configure, and manage on-premises Linux HPC nodes used for secondary and tertiary data processing
Define infrastructure-as-code (Terraform, Ansible, or similar) for reproducible HPC node provisioning and configuration
Manage high-speed networking infrastructure between instruments, HPC nodes, and storage (configuration, monitoring, troubleshooting)
Implement and manage shared storage systems (NFS, parallel filesystems, or similar) accessible to both local HPC and cloud compute
Design and operate hybrid burst-to-cloud infrastructure — provision and manage AWS compute resources that extend local HPC capacity on demand
Collaborate with the data pipeline team to ensure infrastructure meets throughput, latency, and reliability requirements
Manage OS patching, driver updates, and GPU runtime environments across HPC nodes
Monitor HPC cluster health, utilization, and capacity to inform scaling decisions
Design and operate data ingestion pipelines for high-volume experiment data from lab instruments
Implement tiered storage strategies (hot/warm/cold) to balance accessibility, performance, and cost
Deploy and manage search infrastructure (Elasticsearch/ OpenSearch) to make experiment data universally discoverable and queryable
Build data cataloging and metadata tagging systems so datasets are well-organized and self-describing
Integrate visualization tools (Grafana, Kibana, or similar) to enable engineers and scientists to explore and analyze experiment data
Design data lifecycle policies including retention, archival, and compliance requirements
Ensure data pipelines are reliable, idempotent, and observable with clear error handling and retry logic
Work with engineering and science teams to define data schemas, access patterns, and query requirements
Own deployment workflows for software delivered to embedded instruments in our central lab
Manage release processes for a small number of complex, high- value lab-operated instruments
Design deployment strategies that account for rollback, validation, and minimal downtime
Coordinate versioned releases across multiple software components and dependencies
Support development, staging, and production environment parity
Implement centralized log collection and aggregation across cloud and on-site systems
Deploy and manage observability tooling (Prometheus, Grafana, Loki, CloudWatch, or similar)
Ensure structured, searchable logging with clear correlation across services
Build dashboards and alerting for infrastructure health, pipeline status, and deployment state
Establish traceability standards linking builds, tests, artifacts, and deployments
Support diagnostics and post-mortem analysis for production incidents
Integrate agentic AI tools into CI/CD workflows to automate code review, test generation, and pipeline troubleshooting
Evaluate and deploy AI-powered assistants for infrastructure management, incident response, and operational tasks
Design guardrails and human-in-the-loop controls for AI-driven automation in production environments
Stay current with the rapidly evolving landscape of AI-augmented development and DevOps tooling
Champion adoption of agentic AI across engineering workflows to accelerate delivery and improve reliability