Site Reliability Engineer (SRE)

Talentus Global•San Jose, CA

48d•Remote

About The Position

At Talentus Global, we are looking for you! We are a U.S. company with a strong presence in LATAM and across 20+ countries around the world. Some of our key near-shore BPO services include: smart-sourcing, dedicated or cluster teams, managed IT services, software outsourcing, and top ERP & CRM solutionsâdriven by our practices across many industries, including Higher Education. We are currently looking for a Site Reliability Engineer (SRE), to become a valuable addition to our dynamic team!

Requirements

4 to 6 years of experience in Site Reliability Engineering, DevOps, or related roles.
Strong understanding of system reliability, scalability, and performance engineering.
Experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, or cloud-native tools).
Familiarity with cloud platforms such as Azure, AWS, or GCP.
Experience with scripting or programming languages ( Python, Go, Bash).
Knowledge of CI/CD pipelines and DevOps practices.
Experience with containerization and orchestration tools (Docker, Kubernetes).
Strong troubleshooting and incident management skills.
Understanding of networking, distributed systems, and system architecture.
Experience working in Agile/Scrum environments.
Advanced English proficiency skills (C1) required.
Must have experience working for US clients

Responsibilities

Ensure high availability, reliability, and performance of applications and infrastructure.
Define and monitor SLIs, SLOs, and SLAs to maintain service reliability.
Implement automation to reduce manual operations and improve system efficiency.
Monitor systems, detect anomalies, and respond to incidents in a timely manner.
Lead incident management, root cause analysis (RCA), and post-mortem processes.
Collaborate with development and DevOps teams to improve system resilience and scalability.
Manage observability tools (monitoring, logging, tracing) to gain system insights.
Optimize system performance, capacity planning, and cost efficiency.
Implement reliability best practices, including redundancy, failover, and disaster recovery.
Continuously improve system reliability through proactive engineering initiatives.