Staff Site Reliability Engineer, FedRamp

Netskope-posted about 2 months ago

Full-time • Mid Level

Santa Clara, CA

1,001-5,000 employees

Publishing Industries

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

We are a team of software engineers focused on improving availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the engineering stacks. If you are passionate about solving complex problems and developing cloud services at scale, we would like to speak with you. As a SRE, you will be critical to deploying and managing cutting-edge infrastructure crucial for AI/ML operations, and you will collaborate with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will also extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs. You will ensure that training environments are optimally available and efficiently managed across multiple clusters, enhancing our containerization and orchestration systems with advanced tools like Docker and Kubernetes.

Work closely with AI/ML engineers and researchers to participate in the designing and architecture of AI ML Applications for scale and reliability.
Design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.
Involve in production troubleshooting of AI ML Application code as well as infrastructure configurations.
Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.
Ensure training environments are consistently available and prepared across multiple clusters.
Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.
Operate and oversee large Kubernetes clusters with GPU workloads.
Improve reliability, quality, and time-to-market of our suite of software solutions
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
Provide primary operational support and engineering for multiple large-scale distributed software applications

Model training
Huggingface Transformers
Pytorch
LLM
TensorRT
Infrastructure as code tools like Terraform
Scripting languages such as Python or Bash
Cloud platforms such as Google Cloud, AWS or Azure
Git and GitHub workflows
Tracing and Monitoring
Familiar with high-performance, large-scale ML systems
You have a knack for troubleshooting complex systems and enjoy solving challenging problems
Proactive in identifying problems, performance bottlenecks, and areas for improvement
Take pride in building and operating scalable, reliable, secure systems
Familiar with monitoring tools such as Prometheus, Grafana, or similar
Are comfortable with ambiguity and rapid change

Familiar with monitoring tools such as Prometheus, Grafana, or similar
8+ years building core infrastructure
Experience running inference clusters at scale
Experience operating orchestration systems such as Kubernetes at scale

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Site Reliability Engineer Resume Examples

•

Site Reliability Engineer Cover Letter Examples

Staff Site Reliability Engineer, FedRamp

Job Search Resources

Tools

Career Hubs

Guides

Company