Member of Technical Staff - Engineering Core Services, Infrastructure Shared Services

Everpure•Santa Clara, CA

4h•Onsite

About The Position

We’re in an unbelievably exciting area of tech and are fundamentally reshaping the data storage industry. Here, you lead with innovative thinking, grow along with us, and join the smartest team in the industry. This type of work—work that changes the world—is what the tech industry was founded on. So, if you're ready to seize the endless opportunities and leave your mark, come join us. THE ROLE As a key member of the Engineering Core Services team, you will champion the reliability and scalability of our global artifact management infrastructure. You’ll serve as the technical lead for our enterprise Artifactory clusters, ensuring the backbone of our CI/CD pipeline remains high-performing and resilient. In this high-impact role, you will collaborate across Infrastructure and Shared Services (ISS) to drive architectural excellence and bridge the gap between development and operations.

Requirements

Deep Infrastructure Expertise: Proven ability to manage complex applications across hybrid-cloud environments (AWS, Azure) and on-premises infrastructure including VMware and Kubernetes.
Advanced Scripting and Programming: Proficiency in Python, Go, and Bash used specifically for infrastructure automation, reliability engineering, and system integration.
Monitoring and Observability Mastery: Hands-on experience leveraging tools like Datadog, Elastic, or Prometheus to proactively identify and resolve system degradation before it impacts users.
Agile Operational Mindset: Experience working within Agile frameworks and Jira to manage high-priority tasks and drive issues to resolution in a fast-paced production environment.
Collaborative Leadership: Exceptional communication skills with a track record of taking ownership of core systems, applying constructive feedback, and leading cross-functional troubleshooting efforts.

Responsibilities

Architect and Optimize Production Systems: Design, operate, and scale distributed Artifactory clusters to ensure 24/7 availability and performance for mission-critical engineering workflows.
Drive Operational Excellence: Implement SRE and DevOps best practices by automating manual interventions and managing event streaming systems like RabbitMQ or Kafka to ensure seamless data flow.
Lead Rapid Incident Response: Own the health of production storage environments by participating in business-hour on-call rotations and resolving complex hardware/software bottlenecks.
Build Technical Knowledge Bases: Curate and maintain high-quality technical documentation, runbooks, and troubleshooting guides to empower the broader engineering organization.
Scale Infrastructure via Automation: Utilize Python, Go, and Bash to build self-healing systems and automate the provisioning of Linux, VMware, and Kubernetes environments.