Staff Engineer, CI/CD & Cloud Infrastructure

Foresite Labs (Stealth Co)San Diego, CA
6d

About The Position

We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C++, and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility. This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable.

Requirements

  • 7+ years of experience in DevOps, CI/CD, or cloud infrastructure roles
  • Strong, hands-on Linux expertise (administration, debugging, performance tuning)
  • Deep experience designing and operating CI/CD pipelines (GitHub Actions preferred)
  • Proven experience managing complex AWS infrastructure at scale
  • Strong knowledge of Docker including multi-stage builds, registries, and orchestration
  • Experience with infrastructure as code using Terraform
  • Experience with Kubernetes and Helm for container orchestration
  • Solid understanding of versioning strategies, artifact management, and release engineering
  • Experience integrating agentic AI into DevOps workflows and CI/CD pipelines
  • Proficiency in Python and shell scripting for automation and tooling
  • Ability to read, debug, and build C/C++ and CUDA applications on Linux
  • Experience integrating build systems (CMake, Make) into CI pipelines
  • Familiarity with package management and dependency resolution across languages
  • Deep AWS experience across IAM, networking (VPC, security groups), storage, compute, and database services
  • Experience managing on-premises Linux HPC infrastructure alongside cloud resources
  • Experience designing for high availability, failover, and disaster recovery
  • Experience with data pipeline and workflow orchestration tools (Step Functions, Airflow, or similar)
  • Experience with search and indexing platforms (Elasticsearch, OpenSearch, or similar)
  • Understanding of tiered storage strategies and data lifecycle management
  • Knowledge of cost management, tagging strategies, and infrastructure governance
  • Experience with logging and monitoring stacks (Prometheus, Grafana, Loki, ELK, or CloudWatch)
  • Understanding of build and artifact traceability practices
  • Experience with structured logging and distributed tracing concepts

Nice To Haves

  • Experience deploying software to embedded or lab-operated instruments
  • Experience with high-speed networking (InfiniBand, RDMA, or 10/25/100GbE) in HPC environments
  • Experience with CUDA build toolchains and GPU-accelerated workloads
  • Familiarity with Azure or GCP in addition to AWS
  • Experience in regulated or reliability-sensitive environments
  • Experience with GitOps workflows and progressive delivery strategies
  • Familiarity with secrets management (Vault, AWS Secrets Manager)

Responsibilities

  • Design, build, and maintain CI/CD pipelines using GitHub Actions or similar platforms
  • Manage build systems for Python, C/C++, and CUDA codebases on Linux
  • Integrate build tools (CMake, Make, pip, setuptools) into automated pipelines
  • Implement robust versioning, tagging, and artifact management strategies
  • Ensure full traceability of builds, test results, and artifacts from commit to deployment
  • Manage Docker-based build environments including base images, caching, and reproducibility
  • Maintain and optimize build performance, parallelism, and reliability
  • Architect and manage complex AWS infrastructure including:
  • IAM roles, policies, and access management
  • Storage services (S3, EBS, EFS) with tiered lifecycle policies
  • Databases (RDS, DynamoDB, or similar) with backup and failover strategies
  • Data workflow and pipeline engines (Step Functions, Airflow, or similar)
  • Compute services (EC2, ECS, EKS, Lambda) scaled to workload requirements
  • Implement infrastructure as code using Terraform
  • Manage Kubernetes clusters and Helm charts for containerized workloads
  • Design for scalability, high availability, and disaster recovery
  • Manage cost optimization, resource tagging, and infrastructure governance
  • Support multi-account and multi-region strategies as needed
  • Familiarity with Azure and GCP for secondary or hybrid requirements
  • Provision, configure, and manage on-premises Linux HPC nodes used for secondary and tertiary data processing
  • Define infrastructure-as-code (Terraform, Ansible, or similar) for reproducible HPC node provisioning and configuration
  • Manage high-speed networking infrastructure between instruments, HPC nodes, and storage (configuration, monitoring, troubleshooting)
  • Implement and manage shared storage systems (NFS, parallel filesystems, or similar) accessible to both local HPC and cloud compute
  • Design and operate hybrid burst-to-cloud infrastructure — provision and manage AWS compute resources that extend local HPC capacity on demand
  • Collaborate with the data pipeline team to ensure infrastructure meets throughput, latency, and reliability requirements
  • Manage OS patching, driver updates, and GPU runtime environments across HPC nodes
  • Monitor HPC cluster health, utilization, and capacity to inform scaling decisions
  • Design and operate data ingestion pipelines for high-volume experiment data from lab instruments
  • Implement tiered storage strategies (hot/warm/cold) to balance accessibility, performance, and cost
  • Deploy and manage search infrastructure (Elasticsearch/ OpenSearch) to make experiment data universally discoverable and queryable
  • Build data cataloging and metadata tagging systems so datasets are well-organized and self-describing
  • Integrate visualization tools (Grafana, Kibana, or similar) to enable engineers and scientists to explore and analyze experiment data
  • Design data lifecycle policies including retention, archival, and compliance requirements
  • Ensure data pipelines are reliable, idempotent, and observable with clear error handling and retry logic
  • Work with engineering and science teams to define data schemas, access patterns, and query requirements
  • Own deployment workflows for software delivered to embedded instruments in our central lab
  • Manage release processes for a small number of complex, high- value lab-operated instruments
  • Design deployment strategies that account for rollback, validation, and minimal downtime
  • Coordinate versioned releases across multiple software components and dependencies
  • Support development, staging, and production environment parity
  • Implement centralized log collection and aggregation across cloud and on-site systems
  • Deploy and manage observability tooling (Prometheus, Grafana, Loki, CloudWatch, or similar)
  • Ensure structured, searchable logging with clear correlation across services
  • Build dashboards and alerting for infrastructure health, pipeline status, and deployment state
  • Establish traceability standards linking builds, tests, artifacts, and deployments
  • Support diagnostics and post-mortem analysis for production incidents
  • Integrate agentic AI tools into CI/CD workflows to automate code review, test generation, and pipeline troubleshooting
  • Evaluate and deploy AI-powered assistants for infrastructure management, incident response, and operational tasks
  • Design guardrails and human-in-the-loop controls for AI-driven automation in production environments
  • Stay current with the rapidly evolving landscape of AI-augmented development and DevOps tooling
  • Champion adoption of agentic AI across engineering workflows to accelerate delivery and improve reliability
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service