Site Reliability Engineering Specialist

BT Group•Iuka, IL

About The Position

Why this job mattersThe Site Reliability Engineering Specialist independently executes activities that help ensures BT is in the best position to deliver the service performance, reliability and availability that internal and external customers expect, through enabling cross-team engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.What you’ll be doing1. Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines (continuous integration/continuous delivery pipelines whilst executing best practices with a focus on the re-use of application code, demonstrates consistent software delivery practices and produces continuous integration/continuous delivery platform solutions using Amazon Web Services cloud, infrastructure as code (IaC), GitOps, and container technologies2. Coordinates a diverse team and creates the initial test schedule to deliver all aspects of testing to time, budget and quality targets, ensuring producing outlines of solutions and defining depth of testing required3. Executes the implementation of automation technologies to ensure repeatability, eliminating toil, reducing mean time to detection and resolution and repair services4. Proactively identifies and manages risk through regular assessment and diligent execution of controls and mitigations, proactively raising any concerns5. Leads scale testing to measure, tune and optimise system performance6. Executes metric/monitoring analysis that creates stability, security, and performance improvements7. Designs, analyses, develops and troubleshoots highly-distributed large-scale production systems spanning on-prem and cloud-based hosting8. Executes approaches that scale systems sustainably through mechanisms like automation and evolves systems by pushing for changes that improve reliability and velocity9. Writes and delivers infrastructure as code software to improve the availability, scalability, latency, and efficiency of services10. Implements robust monitoring and alerting systems and performs root cause analysis and post-mortems with an eye towards future prevention11. Inspects queue and support processing to ensure early warning of support issues12. Executes retrospective and preventive actions after each high severity production incident13. Analyses complex systems from a reliability and resilience perspective and identifies sources of instability in distributed systems14. Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards 15. Mentors other site reliability engineers, helping to improve the team's abilities by acting as a technical resource

Requirements

Troubleshooting
Infrastructure Configuration
Service Assurance
Application Performance Monitoring & Alerting
Computer Networking
System Administration
Programming/Scripting
Artificial Intelligence Operations (AIOps)
Server Architecture
Cloud Computing
Continuous Integration/Continuous Deployment
Automation & Orchestration
Systems Integration
Project/Programme Management
Incident Management
Decision Making
Growth Mindset
Inclusive Leadership

Responsibilities

Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines (continuous integration/continuous delivery pipelines whilst executing best practices with a focus on the re-use of application code, demonstrates consistent software delivery practices and produces continuous integration/continuous delivery platform solutions using Amazon Web Services cloud, infrastructure as code (IaC), GitOps, and container technologies
Coordinates a diverse team and creates the initial test schedule to deliver all aspects of testing to time, budget and quality targets, ensuring producing outlines of solutions and defining depth of testing required
Executes the implementation of automation technologies to ensure repeatability, eliminating toil, reducing mean time to detection and resolution and repair services
Proactively identifies and manages risk through regular assessment and diligent execution of controls and mitigations, proactively raising any concerns
Leads scale testing to measure, tune and optimise system performance
Executes metric/monitoring analysis that creates stability, security, and performance improvements
Designs, analyses, develops and troubleshoots highly-distributed large-scale production systems spanning on-prem and cloud-based hosting
Executes approaches that scale systems sustainably through mechanisms like automation and evolves systems by pushing for changes that improve reliability and velocity
Writes and delivers infrastructure as code software to improve the availability, scalability, latency, and efficiency of services
Implements robust monitoring and alerting systems and performs root cause analysis and post-mortems with an eye towards future prevention
Inspects queue and support processing to ensure early warning of support issues
Executes retrospective and preventive actions after each high severity production incident
Analyses complex systems from a reliability and resilience perspective and identifies sources of instability in distributed systems
Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards
Mentors other site reliability engineers, helping to improve the team's abilities by acting as a technical resource