Senior DevOps/Platform Engineer III - Richland, WA

Pacific Northwest National Laboratory•Richland, WA

2d•Onsite

About The Position

We are seeking a Senior DevOps/Platform Engineer to join PNNL's advanced AI engineering initiatives, contributing to next-generation systems spanning agentic AI platforms, large-scale data orchestration, and real-time intelligence processing. In this role, you'll apply your expertise in scalable system design and AI/ML engineering to build mission-critical capabilities while developing your technical leadership and establishing yourself as a key contributor to our engineering community. You're an accomplished engineer with strong foundations in DevOps, scalable system design, AI/ML development, and production software engineering. You're ready to take on increasing technical responsibility, leading components of complex systems while mentoring junior team members. You excel at translating technical requirements into working solutions, selecting appropriate approaches for challenging problems, and contributing meaningfully to technical direction and project success.

Requirements

Demonstrated proficiency in Python and working knowledge of at least one additional language (C#/.NET, Go, C++) for infrastructure automation and tooling development
Knowledge of Infrastructure as Code principles and tools including Terraform, CloudFormation, Pulumi, or ARM templates with emphasis on modular, reusable code patterns
Ability to design, implement, and maintain sophisticated CI/CD pipelines across multiple environments using tools such as Jenkins, GitLab CI, GitHub Actions, or Azure DevOps
Proficiency with version control workflows (Git), GitOps methodologies, automated testing frameworks for infrastructure code, and policy-as-code practices with consistent use of AI assist tools (e.g., Claude, GitHub Copilot) to accelerate automation and troubleshooting
Demonstrated experience designing and managing infrastructure across cloud platforms (AWS, Azure, or GCP) with multi-cloud experience highly valued
Strong expertise with containerization technologies (Docker) and container orchestration platforms (Kubernetes, EKS, AKS, or GKE) including advanced concepts like operators, custom resources, and cluster management
Ability to design and implement event-driven architectures using cloud-native services (AWS EventBridge, Azure Event Grid, Pub/Sub) and messaging systems with understanding of service mesh technologies (Istio, Linkerd) and API gateway patterns
Knowledge of networking concepts in cloud and containerized environments including CNI plugins, ingress controllers, load balancing, and service discovery with familiarity in edge computing deployments and hybrid cloud architectures
Ability to implement comprehensive observability solutions including metrics collection (Prometheus, CloudWatch), distributed tracing (Jaeger, Tempo), and centralized logging (ELK Stack, Loki, Splunk)
Understanding of Site Reliability Engineering (SRE) principles including SLOs, SLIs, error budgets, and incident response with ability to design and implement chaos engineering practices to improve system resilience
Experience implementing security best practices including secrets management (Vault, AWS Secrets Manager), vulnerability scanning, and DevSecOps tooling
Knowledge of disaster recovery strategies, backup automation, and business continuity planning with understanding of compliance frameworks and ability to implement automated compliance controls
Understanding of cloud-native data pipeline architectures and ETL/ELT orchestration (AWS Glue, Azure Data Factory, Airflow, Prefect) with ability to build and maintain infrastructure supporting ML pipelines, model training workflows, and MLOps practices
Knowledge of deploying and operating cloud-based data storage systems and platforms (S3, Redshift, Delta Lake, PostgreSQL, MongoDB, OpenSearch, Snowflake)
Understanding of distributed data processing frameworks (Spark/Databricks, Kafka, Flink) with experience operating Kubernetes-based platforms for data workloads including Spark on K8s, Ray clusters, or Kubeflow
Ability to implement infrastructure supporting large-scale data systems with appropriate monitoring, cost optimization, and performance tuning including storage tiering, data lifecycle management, and compute resource optimization
Strong problem-solving abilities with experience troubleshooting complex distributed systems spanning applications, infrastructure, and data layers
Excellent communication skills to collaborate effectively with software engineers, data scientists, security teams, and business stakeholders with ability to create clear, comprehensive documentation for infrastructure designs, runbooks, and disaster recovery procedures
Demonstrated capacity to manage multiple infrastructure initiatives simultaneously while maintaining high availability and reliability standards with proven ability to mentor team members on DevOps practices and operational excellence
Experience participating in on-call rotations, incident response, and post-mortem processes with ability to balance tactical operational needs with strategic infrastructure improvements
U.S. Citizenship
Ability to obtain and maintain a federal security clearance.
Background Investigation: Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements for access to classified matter in accordance with 10 CFR 710, Appendix B.
Drug Testing: All Security Clearance positions are Testing Designated Positions, which means that the applicant selected for hire is subject to pre-employment drug testing, and post-employment random drug testing. In addition, applicants must be able to demonstrate non-use of illegal drugs, including marijuana, for the 12 consecutive months preceding completion of the requisite Questionnaire for National Security Positions (QNSP).
Note: Applicants will be considered ineligible for security clearance processing by the U.S. Department of Energy if non-use of illegal drugs, including marijuana, for 12 months cannot be demonstrated.
This position is a Testing Designated Position (TDP). The candidate selected for this position will be subject to pre-employment and random drug testing for illegal drugs, including marijuana, consistent with the Controlled Substances Act and the PNNL Workplace Substance Abuse Program.
As a national laboratory, PNNL is responsible for adhering to the Homeland Security Presidential Directive 12 (HSPD-12) and Department of Energy (DOE) Order 473.1A, which require new employees to obtain and maintain a HSPD-12 Personal Identify Verification (PIV) Credential. To obtain this credential, new employees must successfully complete the applicable tier of federal background investigation post hire and receive a favorable federal adjudication. The tier of federal background investigation will be determined by job duties and national security or public trust responsibilities associated with the job. All tiers of investigation include a declaration of illegal drug activities, including use, supply, possession, or manufacture within the last 1 to 7 years (depending on the applicable tier of investigation). Illegal drug activities include marijuana and cannabis derivatives, which are still considered illegal under federal law, regardless of state laws.
For foreign national candidates: If you have not resided in the U.S. for three consecutive years, you are not eligible for the PIV credential and instead will need to obtain a favorable Local Site Specific Only (LSSO) Federal risk determination to maintain employment. Once you meet the three-year residency requirement thereafter, you will be required to obtain a PIV credential to maintain employment. The tier of federal background investigation required to obtain the PIV credential will be determined by job duties at the time you become eligible for the PIV credential.
The Department of Energy (DOE) prohibits DOE employees and contractors from having any affiliation with the foreign government of a country DOE has identified as a “country of risk” without explicit approval by DOE and Battelle. If you are offered a position at PNNL and currently have any affiliation with the government of one of these countries, you will be required to disclose this information and recuse yourself of that affiliation or receive approval from DOE and Battelle prior to your first day of employment.

Nice To Haves

Degree in computer science, software engineering, or related field
Experience in contributing to technical direction and independently structure complex problems into actionable work, in collaboration with senior engineers and cross-functional teams
Expertise in Python and proficiency in at least one other language (C#/.NET, C++, Go)
3-5 years of hands-on DevOps, Platform Engineering, Site Reliability Engineering, or Infrastructure Engineering experience
Contributions to open-source infrastructure projects or active participation in DevOps communities

Responsibilities

Develop and deploy agentic AI systems with reasoning and decision-making capabilities
Build components of LLM orchestration frameworks using LangChain, LlamaIndex, and emerging platforms
Contribute to MLOps platforms including experiment tracking, model versioning, and deployment pipelines
Create developer tooling, utilities, and interfaces for AI-native frameworks
Integrate multi-modal data sources into cohesive processing pipelines
Develop microservices within distributed architectures handling high-throughput workloads
Build components of real-time streaming platforms and event-driven systems
Implement data pipelines for large-scale ETL, data processing, and analytics
Deploy containerized applications using Kubernetes and support CI/CD pipelines
Contribute to systems deployed in secure and edge environments
Deploy AI systems with appropriate monitoring, logging, and observability
Ensure code quality, security best practices, and compliance standards
Build geospatial processing, time-series, and data fusion capabilities
Support system performance optimization and troubleshooting
Lead technical components of projects and tasks
Mentor junior staff and contribute to team knowledge sharing
Participate in design discussions and contribute to architectural decisions
Support proposal development with technical content and scoping
Build effective collaborations across teams and S&E domains