Navy Federal Credit Union-posted about 1 year ago
Full-time • Mid Level
Vienna, VA
Credit Intermediation and Related Activities

The Cloud Site Reliability Engineer (Azure) at Navy Federal Credit Union is responsible for supporting software development, operations, and maintenance within complex cloud infrastructure. This role focuses on improving performance, stability, and reliability through automated solutions, providing Tier 3 support for cloud applications and platforms. The ideal candidate will have hands-on experience with the software development lifecycle and a strong understanding of maintaining cloud services.

  • Set up and maintain Azure-native monitoring tools like Azure Monitor, Log Analytics, and Application Insights.
  • Build tailored dashboards for key metrics and configure proactive alerting mechanisms.
  • Utilize Azure Sentinel for security incident detection and response.
  • Implement end-to-end observability practices for containerized applications.
  • Design and maintain automation scripts using Python, PowerShell, or Bash.
  • Develop runbooks and automated workflows for incident response.
  • Create scripts for automatic system adjustments and recovery actions.
  • Utilize Terraform or ARM templates for cloud resource provisioning.
  • Lead the identification and troubleshooting of system performance issues.
  • Conduct thorough post-incident analyses and document root causes.
  • Keep incident response runbooks up to date with best practices.
  • Continuously monitor key performance indicators (KPIs) across cloud resources.
  • Propose strategies to improve cost-efficiency and performance of cloud services.
  • Work closely with architecture and development teams for robust cloud solutions.
  • Implement best practices for optimizing container performance within AKS clusters.
  • Provide feedback to development teams for reliability and scalability.
  • Advocate for best practices in reliability and incident management.
  • Collaborate with security teams to mitigate vulnerabilities in cloud infrastructure.
  • Create comprehensive documentation for monitoring configurations and incident response protocols.
  • Contribute to internal training resources for team members.
  • Analyze usage data for cost optimization opportunities.
  • Use Azure Cost Management + Billing to monitor expenses and track costs.
  • Work with architecture teams to design cost-effective solutions.
  • Develop automation scripts for dynamic resource allocation based on load.
  • Proficiency in Service Level Objectives, Service Level Indicators, and error budgeting.
  • Expertise in chaos engineering practices to improve system resiliency.
  • Deep knowledge of monitoring and observability tools like Prometheus and Grafana.
  • Strong troubleshooting abilities for distributed systems.
  • Experience implementing incident management frameworks.
  • Bachelor's Degree in Information Technology or equivalent experience.
  • Solid hands-on experience in a Site Reliability Engineer or DevOps Engineer role with a focus on Azure cloud services.
  • Proficiency in scripting languages such as Python, PowerShell, or Bash.
  • Extensive experience with Azure monitoring tools like Azure Monitor, Log Analytics, Application Insights, and Azure Sentinel.
  • Familiarity with AKS and best practices for monitoring containerized applications.
  • Proven track record of effective troubleshooting and resolution of cloud infrastructure issues.
  • Hands-on experience creating automated solutions using IaC tools like Terraform or ARM templates.
  • Strong interpersonal skills for effective collaboration.
  • Azure certifications such as Microsoft Certified: Azure Administrator Associate or Azure Solutions Architect Expert.
  • Experience with Kusto Query Language (KQL) for data analysis.
  • Familiarity with integrating security best practices into monitoring and incident response.
  • Dynatrace experience.
  • Knowledge of DevOps and Agile Methodologies.
  • Experience in Microsoft Azure Technologies.
  • Experience in Tanzu Application/Container Services or equivalent platforms.
  • Experience using ServiceNow ITOM and ITSM.
  • Competitive pay
  • Generous benefits and perks
  • Hybrid workplace options
  • Employee referral program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service