Site Reliability Engineer

Umanist StaffingCharlotte, NC

About The Position

Position: Site Reliability Engineer Required Skills and Qualifications Core SRE Skills 5+ years of experience in Site Reliability Engineering or DevOps roles Strong understanding of SRE principles (SLIs, SLOs, error budgets, toil reduction) Experience with incident management and post-mortem processes Proven ability to design for reliability, scalability, and performance Cloud & Infrastructure (Must Have) Deep AWS experience: ECS Fargate, Lambda, EventBridge, Route 53, S3, Secret Manager, CloudWatch Terraform expertise for infrastructure as code Container management with ECS Fargate, Linux shell scripting and automation Experience with GitLab CI/CD pipelines Understanding of networking, security, and cloud architecture patterns Observability & Monitoring (Must Have) Hands-on experience with Dynatrace for APM and infrastructure monitoring Proficiency with AWS CloudWatch and X-Ray for distributed tracing Experience with LogRocket or similar session replay tools Power BI or similar BI tools for operational reporting and dashboards Log aggregation and analysis (CloudWatch Logs, CloudWatch Insights) Development Skills Strong programming skills in Java or Node.js REST API and/or SOAP service development experience Understanding of microservices and serverless architecture patterns Experience with NoSQL databases (DynamoDB) Ability to read, understand, and debug application code Knowledge of modern design patterns FinOps & Cost Management Experience with AWS cost optimization and FinOps practices Ability to analyze cloud spending and identify optimization opportunities Understanding of AWS pricing models and cost allocation strategies Key Responsibilities Observability & Monitoring Implement and maintain comprehensive monitoring using Dynatrace, CloudWatch, LogRocket, and X-Ray Leverage AI/ML capabilities in observability tools to detect anomalies and predict potential issues Set up intelligent alerts and dashboards for critical Deposits functionality Create Power BI reports for operational metrics, SLI/SLO tracking, and executive visibility Monitor critical applications to minimize downtime and ensure availability Establish golden signals monitoring (latency, traffic, errors, saturation) Incident Management Assist in triage during production incidents and outages Respond to incidents and participate in post-mortem analysis Document incident timelines, root causes, and remediation actions Implement preventive measures based on post-mortem findings Work closely with developers to collaborate on issue resolution Establish and improve incident response runbooks Automation & Infrastructure Automate platform and infrastructure provisioning using Terraform Automate operational tasks to reduce manual toil Contribute to continuous integration and continuous delivery pipelines Implement infrastructure as code best practices Build self-healing systems and automated remediation workflows FinOps & Cost Optimization Manage AWS cost optimization for Deposits application stack Implement cost monitoring and alerting mechanisms Identify and eliminate waste in cloud resources Right-size infrastructure based on usage patterns Create cost allocation reports and dashboards Collaborate with teams on cost-effective architecture decisions Cross-Team Collaboration Work with Ops, Sustain, and Development teams during incidents and issues Partner with Full Stack developers on reliability requirements Collaborate with Security teams on compliance and governance Engage with Product teams to understand business requirements Full Stack Support Support the team's transition to Full Stack development model Understand application code and architecture across all layers Contribute to code reviews with reliability and performance focus Assist developers with debugging production issues

Requirements

  • 5+ years of experience in Site Reliability Engineering or DevOps roles
  • Strong understanding of SRE principles (SLIs, SLOs, error budgets, toil reduction)
  • Experience with incident management and post-mortem processes
  • Proven ability to design for reliability, scalability, and performance
  • Deep AWS experience: ECS Fargate, Lambda, EventBridge, Route 53, S3, Secret Manager, CloudWatch
  • Terraform expertise for infrastructure as code
  • Container management with ECS Fargate,
  • Linux shell scripting and automation
  • Experience with GitLab CI/CD pipelines
  • Understanding of networking, security, and cloud architecture patterns
  • Hands-on experience with Dynatrace for APM and infrastructure monitoring
  • Proficiency with AWS CloudWatch and X-Ray for distributed tracing
  • Experience with LogRocket or similar session replay tools
  • Power BI or similar BI tools for operational reporting and dashboards
  • Log aggregation and analysis (CloudWatch Logs, CloudWatch Insights)
  • Strong programming skills in Java or Node.js
  • REST API and/or SOAP service development experience
  • Understanding of microservices and serverless architecture patterns
  • Experience with NoSQL databases (DynamoDB)
  • Ability to read, understand, and debug application code
  • Knowledge of modern design patterns
  • Experience with AWS cost optimization and FinOps practices
  • Ability to analyze cloud spending and identify optimization opportunities
  • Understanding of AWS pricing models and cost allocation strategies

Responsibilities

  • Implement and maintain comprehensive monitoring using Dynatrace, CloudWatch, LogRocket, and X-Ray
  • Leverage AI/ML capabilities in observability tools to detect anomalies and predict potential issues
  • Set up intelligent alerts and dashboards for critical Deposits functionality
  • Create Power BI reports for operational metrics, SLI/SLO tracking, and executive visibility
  • Monitor critical applications to minimize downtime and ensure availability
  • Establish golden signals monitoring (latency, traffic, errors, saturation)
  • Assist in triage during production incidents and outages
  • Respond to incidents and participate in post-mortem analysis
  • Document incident timelines, root causes, and remediation actions
  • Implement preventive measures based on post-mortem findings
  • Work closely with developers to collaborate on issue resolution
  • Establish and improve incident response runbooks
  • Automate platform and infrastructure provisioning using Terraform
  • Automate operational tasks to reduce manual toil
  • Contribute to continuous integration and continuous delivery pipelines
  • Implement infrastructure as code best practices
  • Build self-healing systems and automated remediation workflows
  • Manage AWS cost optimization for Deposits application stack
  • Implement cost monitoring and alerting mechanisms
  • Identify and eliminate waste in cloud resources
  • Right-size infrastructure based on usage patterns
  • Create cost allocation reports and dashboards
  • Collaborate with teams on cost-effective architecture decisions
  • Work with Ops, Sustain, and Development teams during incidents and issues
  • Partner with Full Stack developers on reliability requirements
  • Collaborate with Security teams on compliance and governance
  • Engage with Product teams to understand business requirements
  • Support the team's transition to Full Stack development model
  • Understand application code and architecture across all layers
  • Contribute to code reviews with reliability and performance focus
  • Assist developers with debugging production issues
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service