Lawrence Berkeley National Laboratory-posted 23 days ago
Full-time • Mid Level
Hybrid • Berkeley, CA
101-250 employees

Berkeley Lab’s (LBNL) Environmental Genomics and Systems Biology (EGSB) Division is looking for a DevOps Software Engineer to join the US Department of Energy’s (DOE) Systems Biology Knowledgebase (KBase) team! In this exciting role, you will contribute directly to an open-source platform that is transforming how biologists and data scientists collaborate, share data, and accelerate discovery. KBase integrates massive biological datasets and powerful computational tools into a unified, extensible system that supports transparent, reproducible science. This position will contribute to the core infrastructure of the system, supporting the operation and evolution of an advanced platform that integrates cloud-native software, on-premise hardware, and high-performance computing hosted in National Lab data centers. This role is responsible for proactively identifying and resolving complex issues to ensure the platform's stability, performance, and scalability. This position has an anticipated start date of February 2, 2026. We’re here for the same mission, to bring science solutions to the world. Join our team and YOU will play a supporting role in our goal to address global challenges! Have a high level of impact and work for an organization associated with 17 Nobel Prizes! Why join Berkeley Lab? We invest in our employees by offering a total rewards package you can count on: Exceptional health and retirement benefits, including pension or 401K-style plans A culture where you’ll belong - we are invested in our teams! In addition to accruing vacation and sick time, we also have an annual Winter Holiday Shutdown Parental bonding leave (for both mothers and fathers) Pet insurance

  • Develop and implement automation to deploy, configure, and support on-premise compute resources and services (e.g., databases, microservices, LLMs, monitoring systems, object storage like Minio, and High - Performance Computing (HPC)).
  • Design, implement, and support robust monitoring, alerting, and logging solutions for infrastructure and platform services.
  • Ensure the security, reliability, and performance of KBase's on-premise hardware and software stack by documenting, hardening, and continuously improving its security posture in adherence with National Lab and DOE security standards.
  • Develop and maintain comprehensive documentation for infrastructure designs, configurations, and operational procedures.
  • Implement DevSecOps pipelines, best practices, and security scanning (SCA/SAST) for infrastructure and software components.
  • A Bachelor’s Degree (or equivalent knowledge/training) in Computer Science, Engineering, or a related field and a minimum of 5 years of relevant experience as a Software Infrastructure Engineer, DevOps Engineer, Site Reliability Engineer (SRE), or similar role or an equivalent combination of education and experience.
  • Experience with infrastructure as code (laC) tools (e.g., Terraform, Ansible), containerization technologies (e.g., Docker), and container orchestration platforms (e.g., Kubernetes).
  • Experience with containerization (Docker) and Kubernetes orchestration, including Helm, operators, and resource management for data-intensive workloads.
  • Experience with version control systems (e.g., Git), CI/CD pipelines, monitoring, and observability tools (e.g., Prometheus, Grafana, ELK stack or similar).
  • Experience with the deployment and management of relational and/or NoSQL databases.
  • Expert-level knowledge of Linux operating systems, system administration, and proficiency in scripting languages (e.g., Python, Bash, Go).
  • Proficiency in Python, with the ability to write modular, production-ready software and integrate it into cloud-native workflows.
  • Demonstrated understanding of core DevOps, software engineering principles for on-premise distributed systems, microservices, and HPC architectures.
  • Familiarity with object storage systems such as MinIO or AWS S3 and understanding of data lifecycle management in distributed storage.
  • Familiarity with Apache Spark (PySpark, SparkSQL, or Structured Streaming) and distributed data processing frameworks.
  • Excellent oral and written communication skills including experience organizing and presenting information to technical and non technical audiences.
  • Strong analytical skills including experience identifying and solving complex technical problems.
  • Demonstrated interpersonal skills including experience collaborating with a variety of scientific, operations, and technical teams.
  • Must be available to come onsite as required to access the server room for maintenance or troubleshooting.
  • A Master’s Degree (or equivalent knowledge/training) in Computer Science, Engineering, or a related discipline.
  • Experience with Computational or Systems Biology within an academic or research environment.
  • Experience with virtualization technologies (e.g., KVM), distributed messaging or search systems (e.g., Kafka, Elasticsearch), and MLOps practices and tools. (e.g., MLflow, Kubeflow, Model Serving infrastructure etc).
  • Experience with HPC environments and workload managers/schedulers (e.g., Slurm, HTCondor, PBS).
  • Exceptional health and retirement benefits, including pension or 401K-style plans
  • A culture where you’ll belong - we are invested in our teams!
  • In addition to accruing vacation and sick time, we also have an annual Winter Holiday Shutdown
  • Parental bonding leave (for both mothers and fathers)
  • Pet insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service