Site Reliability Engineer (SRE) – Azure

San R&D Business Solutions LLCBrookhaven, GA
2d

About The Position

We are seeking an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure and proven experience supporting environments within the Banking or Financial Services industry. This role is responsible for designing and maintaining reliable, scalable, and secure cloud infrastructure while ensuring high availability and optimal performance of mission-critical applications in a regulated environment. The ideal candidate brings a strong production support mindset and can effectively balance system reliability, automation, and delivery speed.

Requirements

  • 7–12 years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
  • Strong hands-on experience with Microsoft Azure services including VMs, AKS, App Services, Networking, Storage, and Azure AD.
  • Experience with Infrastructure as Code tools such as Terraform, ARM templates, or Bicep.
  • Expertise in CI/CD tools such as Azure DevOps, Jenkins, or GitHub Actions.
  • Strong scripting skills using PowerShell, Python, or Bash.
  • Experience with Docker and Kubernetes container orchestration.
  • Prior experience working within Banking or Financial Services environments.
  • Solid understanding of security, compliance, and risk management in regulated industries.

Nice To Haves

  • Experience with monitoring and observability tools such as Azure Monitor, Prometheus, Grafana, or Splunk.
  • Knowledge of high availability and disaster recovery architectures.
  • Familiarity with ITIL processes and incident management frameworks.
  • Microsoft Azure certifications such as AZ-104, AZ-400, or equivalent are a plus.

Responsibilities

  • Design, deploy, and manage highly available, scalable cloud infrastructure on Microsoft Azure.
  • Enhance system reliability, performance, and uptime through automation and proactive monitoring.
  • Build, maintain, and optimize CI/CD pipelines for enterprise and cloud-native applications.
  • Define, track, and improve SLIs, SLOs, and SLAs.
  • Implement and manage observability solutions including logging, monitoring, and alerting.
  • Support incident response, perform root cause analysis, and drive post-incident improvements.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC) practices.
  • Ensure infrastructure and applications comply with banking security and regulatory standards.
  • Collaborate closely with DevOps, development, security, and operations teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service