About The Position

The Senior Manager, Site Reliability Engineering-FedRAMP at Splunk is responsible for leading a team that manages one of the largest cloud-scale, big data, and microservices platforms. This role focuses on ensuring the reliability and resiliency of services and infrastructure while driving initiatives in automation and infrastructure-as-code. The manager will mentor engineers, oversee reliability projects, and collaborate across the organization to deliver high-quality products that enhance customer experiences.

Requirements

  • 8+ years of experience in handling large-scale cloud-native microservices platforms.
  • 4+ years of strong hands-on management experience with teams deploying and monitoring large-scale Kubernetes clusters in AWS or GCP.
  • Experience in infrastructure automation and scripting using Python and/or Golang.
  • Experience managing remote teams.
  • Strong hands-on experience with monitoring tools such as Splunk, Prometheus, Grafana, ELK stack for observability in microservices deployments.
  • Experience with deployment and operations of large-scale clusters like Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis.

Nice To Haves

  • Familiarity with compliance environments such as HIPPA, GovCloud, State Government, Federal Government, SOC2 or FedRAMP.
  • AWS Solutions Architect certification preferred.
  • Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications preferred.
  • Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM.
  • Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo.

Responsibilities

  • Lead a team of engineers focused on large scale distributed systems for Splunk Cloud Observability in FedRAMP environments.
  • Manage cross-organizational efforts to deliver quality products that meet user needs.
  • Mentor and develop teams of engineers building a cloud-based environment for massive-scale data processing.
  • Collaborate with Talent Acquisition to recruit and hire top engineering talent for the SRE FedRAMP team.
  • Oversee reliability projects including HA, Business Continuity Planning, disaster recovery, and backup/restore.
  • Implement chaos engineering practices to enhance system reliability.
  • Manage application uptime and performance, including capacity management and planning.
  • Establish SLIs, SLOs, error budgets, and monitoring dashboards.
  • Oversee deployment and operations of large-scale distributed data stores and streaming services.
  • Create design patterns for monitoring and benchmarking.
  • Document production run books and guidelines for developers.
  • Drive tooling, toil reduction, and automation for production environments.
  • Manage incident response and improve MTTD/MTTR for services.
  • Optimize cloud costs.

Benefits

  • Medical insurance
  • Dental insurance
  • Vision insurance
  • 401(k) plan with match
  • Paid time off
  • Flexible working arrangements
  • Incentive compensation
  • Equity or long-term cash awards

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service