Site Reliability Engineer

Umanist Staffing•Charlotte, NC

69d

About The Position

Position: Site Reliability Engineer Required Skills and Qualifications Core SRE Skills 5+ years of experience in Site Reliability Engineering or DevOps roles Strong understanding of SRE principles (SLIs, SLOs, error budgets, toil reduction) Experience with incident management and post-mortem processes Proven ability to design for reliability, scalability, and performance Cloud & Infrastructure (Must Have) Deep AWS experience: ECS Fargate, Lambda, EventBridge, Route 53, S3, Secret Manager, CloudWatch Terraform expertise for infrastructure as code Container management with ECS Fargate, Linux shell scripting and automation Experience with GitLab CI/CD pipelines Understanding of networking, security, and cloud architecture patterns Observability & Monitoring (Must Have) Hands-on experience with Dynatrace for APM and infrastructure monitoring Proficiency with AWS CloudWatch and X-Ray for distributed tracing Experience with LogRocket or similar session replay tools Power BI or similar BI tools for operational reporting and dashboards Log aggregation and analysis (CloudWatch Logs, CloudWatch Insights) Development Skills Strong programming skills in Java or Node.js REST API and/or SOAP service development experience Understanding of microservices and serverless architecture patterns Experience with NoSQL databases (DynamoDB) Ability to read, understand, and debug application code Knowledge of modern design patterns FinOps & Cost Management Experience with AWS cost optimization and FinOps practices Ability to analyze cloud spending and identify optimization opportunities Understanding of AWS pricing models and cost allocation strategies Key Responsibilities Observability & Monitoring Implement and maintain comprehensive monitoring using Dynatrace, CloudWatch, LogRocket, and X-Ray Leverage AI/ML capabilities in observability tools to detect anomalies and predict potential issues Set up intelligent alerts and dashboards for critical Deposits functionality Create Power BI reports for operational metrics, SLI/SLO tracking, and executive visibility Monitor critical applications to minimize downtime and ensure availability Establish golden signals monitoring (latency, traffic, errors, saturation) Incident Management Assist in triage during production incidents and outages Respond to incidents and participate in post-mortem analysis Document incident timelines, root causes, and remediation actions Implement preventive measures based on post-mortem findings Work closely with developers to collaborate on issue resolution Establish and improve incident response runbooks Automation & Infrastructure Automate platform and infrastructure provisioning using Terraform Automate operational tasks to reduce manual toil Contribute to continuous integration and continuous delivery pipelines Implement infrastructure as code best practices Build self-healing systems and automated remediation workflows FinOps & Cost Optimization Manage AWS cost optimization for Deposits application stack Implement cost monitoring and alerting mechanisms Identify and eliminate waste in cloud resources Right-size infrastructure based on usage patterns Create cost allocation reports and dashboards Collaborate with teams on cost-effective architecture decisions Cross-Team Collaboration Work with Ops, Sustain, and Development teams during incidents and issues Partner with Full Stack developers on reliability requirements Collaborate with Security teams on compliance and governance Engage with Product teams to understand business requirements Full Stack Support Support the team's transition to Full Stack development model Understand application code and architecture across all layers Contribute to code reviews with reliability and performance focus Assist developers with debugging production issues

Requirements

5+ years of experience in Site Reliability Engineering or DevOps roles
Strong understanding of SRE principles (SLIs, SLOs, error budgets, toil reduction)
Experience with incident management and post-mortem processes
Proven ability to design for reliability, scalability, and performance
Deep AWS experience: ECS Fargate, Lambda, EventBridge, Route 53, S3, Secret Manager, CloudWatch
Terraform expertise for infrastructure as code
Container management with ECS Fargate,
Linux shell scripting and automation
Experience with GitLab CI/CD pipelines
Understanding of networking, security, and cloud architecture patterns
Hands-on experience with Dynatrace for APM and infrastructure monitoring
Proficiency with AWS CloudWatch and X-Ray for distributed tracing
Experience with LogRocket or similar session replay tools
Power BI or similar BI tools for operational reporting and dashboards
Log aggregation and analysis (CloudWatch Logs, CloudWatch Insights)
Strong programming skills in Java or Node.js
REST API and/or SOAP service development experience
Understanding of microservices and serverless architecture patterns
Experience with NoSQL databases (DynamoDB)
Ability to read, understand, and debug application code
Knowledge of modern design patterns
Experience with AWS cost optimization and FinOps practices
Ability to analyze cloud spending and identify optimization opportunities
Understanding of AWS pricing models and cost allocation strategies

Responsibilities

Implement and maintain comprehensive monitoring using Dynatrace, CloudWatch, LogRocket, and X-Ray
Leverage AI/ML capabilities in observability tools to detect anomalies and predict potential issues
Set up intelligent alerts and dashboards for critical Deposits functionality
Create Power BI reports for operational metrics, SLI/SLO tracking, and executive visibility
Monitor critical applications to minimize downtime and ensure availability
Establish golden signals monitoring (latency, traffic, errors, saturation)
Assist in triage during production incidents and outages
Respond to incidents and participate in post-mortem analysis
Document incident timelines, root causes, and remediation actions
Implement preventive measures based on post-mortem findings
Work closely with developers to collaborate on issue resolution
Establish and improve incident response runbooks
Automate platform and infrastructure provisioning using Terraform
Automate operational tasks to reduce manual toil
Contribute to continuous integration and continuous delivery pipelines
Implement infrastructure as code best practices
Build self-healing systems and automated remediation workflows
Manage AWS cost optimization for Deposits application stack
Implement cost monitoring and alerting mechanisms
Identify and eliminate waste in cloud resources
Right-size infrastructure based on usage patterns
Create cost allocation reports and dashboards
Collaborate with teams on cost-effective architecture decisions
Work with Ops, Sustain, and Development teams during incidents and issues
Partner with Full Stack developers on reliability requirements
Collaborate with Security teams on compliance and governance
Engage with Product teams to understand business requirements
Support the team's transition to Full Stack development model
Understand application code and architecture across all layers
Contribute to code reviews with reliability and performance focus
Assist developers with debugging production issues