Manager, SRE FedRAMP-33539

Cisco•Chicago, IL

About The Position

Splunk, a Cisco company, is building a safer and more resilient digital world with an end-to-end full stack platform made for a hybrid, multi-cloud world. Leading enterprises use our unified security and observability platform to keep their digital systems secure and reliable. Come help organizations be their best, while you reach new heights with a team that has your back. Meet the Team The Splunk Observability Cloud team provides full-fidelity monitoring and fixing across infrastructure, applications, and user interfaces, in real-time and at any scale, to help our customers keep their services reliable, innovate faster, and deliver great customer experiences. Infrastructure Software Engineers at Splunk are cloud-native systems engineers who use infrastructure-as-code, microservices, automation, and efficient design to build, operate, and scale our products. You will lead and manage one of the largest and most sophisticated cloud-scale, Bigdata, and microservices platforms in the world. You will be responsible for managing engineers who operate highly available, scalable, and cost-efficient applications with low operational burden by handling and improving the reliability and resiliency of services and infrastructure. You thrive driving initiatives on automation, infrastructure-as-code, reliability engineering, and getting rid of tedious, manual tasks. Lead a team of super smart engineers who are passionate about large scale distributed systems for Splunk Cloud Observability in FedRAMP environments Manage across the organization to deliver quality products that delight Splunk's passionate users.Mentor and grow teams of tight-knit engineers who are building a state-of-the-art, cloud-based environment for massive-scale data processing. Partner with our Talent Acquisition team as we recruit, interview and hire the best engineering talent to join Splunk's growing SRE FedRAMP team! Manage engineers to achieve more than they thought possible. You enjoy managing and driving teams to success and are fulfilled through the success of others.

Requirements

8+ years of experience in handling large-scale cloud-native microservices platforms.
2+ years of strong hands-on management experience managing teams deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
Experience with and leading a team in infrastructure automation and scripting using Python and/or Golang.
Experience managing remote teams.
Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
Experience with deployment, operations, and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems

Nice To Haves

Familiarity working with and/or managing in compliance environments such as HIPPA, GovCloud, State Government, Federal Government, SOC2 or FedRAMP
AWS Solutions Architect certification preferred.
Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred
Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
Bachelors/Masters in Computer Science, Computer Engineering, or related technical field, or equivalent practical experience.

Responsibilities

Manage a team working on reliability projects, including: HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO
Chaos engineering
Application uptime and performance
Capacity management & planning
SLIs, SLOs, error budgets, and monitoring dashboards
Responsible for deployment and operations of large-scale distributed data stores and streaming services
Establishing design patterns for monitoring and benchmarking
Establishing and documenting production run books and guidelines for developers
Tooling, toil reduction, runbooks & automation to handle production environments
Incident management and improving MTTD/MTTR for services
Cloud cost optimization

Benefits

U.S. employees are offered benefits, subject to Cisco’s plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance.
U.S. employees are eligible for paid time away as described below, subject to Cisco’s policies: 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees 1 paid day off for employee’s birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco Non-exempt employees receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees Exempt employees participate in Cisco’s flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations) 80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours of unused sick time carried forward from one calendar year to the next Additional paid time away may be requested to deal with critical or emergency issues for family members Optional 10 paid days per full calendar year to volunteer
For non-sales roles, employees are also eligible to earn annual bonuses subject to Cisco’s policies.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume