Lead Site Reliability Engineer

By Light HQOrlando, FL
13d

About The Position

Lead a technical team of SREs to ensure maximum uptime of a large private and hybrid cloud environment. Oversight includes all core infrastructure, hypervisors, Kubernetes, and applications. Ensure continuity of operations for end users. In this role you will need to foster a relationship of trust with our customers and the current in place technical staff, while expanding the team to ensure system availability. Proactively find system deficiencies, prioritize backlogs, and lead the team to correct issues before they impact users. Coordinate with development engineering teams and follow program processes to ensure controlled changes to production systems. Identify and implement automation, observability, and configuration updates throughout the system. Work as part of a fast moving highly diverse team across multiple projects simultaneously.

Requirements

  • Must have experience leading a technical team of engineers, including performing supervisory duties to ensure successful team outcomes.
  • Must be able to understand complex enterprise IT system architectures (hardware and software) with the ability to pinpoint the team’s focus areas and prioritizations to maximize impact.
  • Must be a self-starter in a fast-paced environment and able to work with a multi-faceted technical team of engineers with a diverse set of skills at differing levels of experience.
  • Establish processes and procedures for maintaining infrastructure configuration management across the enterprise to include automated system checks against the configuration managed baseline.
  • Must be able to effectively manage processes associated with the team’s use of Git repositories to ensure proper configuration management.
  • Analyzes real world problems and implements solutions according to corporate and government guidelines, procedures, and industry best practices.
  • Assures system stability, accessibility, and proper configuration of assigned technical systems and components.
  • Be sensitive and flexible to the needs and requirements of the customer.
  • Must be comfortable with Linux system management, as this is the primary operating system, as well as other Red Hat enterprise services.
  • Experience with VMware Cloud Foundations, and the overarching VMware hypervisor and management stack and services
  • Operate, and maintain physical switches, routers, IPS, IDS devices.
  • Operate, and maintain a VMware NSX based SDN.
  • Understand Kubernetes and micro-service based applications.
  • Must be able to author and maintain scripts and automation in a managed Git repository.
  • Proactively identify, locate, mitigate, and resolve system issues.
  • Document and present performance records and results.
  • Document problems/issues via Jira based ticketing system/tracking log.
  • Review and update all information security management system process and procedures (data/software).
  • Bachelor’s degree in a technical discipline such as computer science or information technology from an accredited college or university.
  • Will consider additional degrees with accompanied technical leadership experience.
  • Ten years of work experience preferred.
  • Security+ certifications are required or must be completed within six months of hire.
  • Please note that pursuant to a government contract, this specific position requires U. S. Citizenship status and a Secret security clearance, with an ability to obtain a TS/SCI. Security Clearance requirement will be specified in the Government's Task Order.

Nice To Haves

  • Experience with Ansible (or similar) infrastructure automation to deploy and configure standard baselines for physical devices and in a virtual environment (VMware and Linux).
  • Utilize scripting for automating tasks, administration, data collection and reporting. This job requires the ability to write, test, debug, deploy, and maintain scripts.
  • Ability to audit and report system and performance logs.
  • Ability to automate, administer and maintain the environment through configuration management, version control, and backups with the use of scripting.
  • Ability to respond professionally, effectively, and efficiently to service requests.
  • Ability to prioritize multiple tasks, projects, and demands.
  • Ability to research and/or implement new technologies.
  • Effective interpersonal and communications skills.
  • Professionally convey system-wide performance information routinely via tools such as PowerPoint, Excel, Visio, etc.
  • Train others to perform similar design and administrative tasks.
  • Interact with vendors/users/customers and developers to understand needs and operational requirements that will impact development and testing activities.
  • Examine any relevant change implementation, then report the changes to developers and testers welcoming feedback for future improvements.
  • Ability to solve technical problems involving a variety of integrated software and hardware platforms.
  • Knowledge of or experience with the DoD Risk Management Framework (RMF) and National Institute of Standards and Technology (NIST) or similar best practices and security guidelines.
  • Ability to assist others with getting certifications, such as providing guidance, mentoring, sandboxes, or cooperation.

Responsibilities

  • Ensure maximum uptime of a large private and hybrid cloud environment.
  • Oversight includes all core infrastructure, hypervisors, Kubernetes, and applications.
  • Ensure continuity of operations for end users.
  • Foster a relationship of trust with our customers and the current in place technical staff, while expanding the team to ensure system availability.
  • Proactively find system deficiencies, prioritize backlogs, and lead the team to correct issues before they impact users.
  • Coordinate with development engineering teams and follow program processes to ensure controlled changes to production systems.
  • Identify and implement automation, observability, and configuration updates throughout the system.
  • Work as part of a fast moving highly diverse team across multiple projects simultaneously.

Benefits

  • Medical, Dental & Vision Coverage
  • Wellness Program
  • 401(k) Matching
  • Disability (Short Term & Long Term)
  • Employee Assistance Program
  • Education & Training
  • Generous Leave Policy (11 Federal Holidays, PTO, Military Leave, Bereavement and Jury Duty)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service