Site Reliability Engineer (SRE)

Talentus GlobalSan Jose, CA
1dRemote

About The Position

At Talentus Global, we are looking for you! We are a U.S. company with a strong presence in LATAM and across 20+ countries around the world. Some of our key near-shore BPO services include: smart-sourcing, dedicated or cluster teams, managed IT services, software outsourcing, and top ERP & CRM solutions—driven by our practices across many industries, including Higher Education. We are currently looking for a Site Reliability Engineer (SRE), to become a valuable addition to our dynamic team!

Requirements

  • 4 to 6 years of experience in Site Reliability Engineering, DevOps, or related roles.
  • Strong understanding of system reliability, scalability, and performance engineering.
  • Experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, or cloud-native tools).
  • Familiarity with cloud platforms such as Azure, AWS, or GCP.
  • Experience with scripting or programming languages ( Python, Go, Bash).
  • Knowledge of CI/CD pipelines and DevOps practices.
  • Experience with containerization and orchestration tools (Docker, Kubernetes).
  • Strong troubleshooting and incident management skills.
  • Understanding of networking, distributed systems, and system architecture.
  • Experience working in Agile/Scrum environments.
  • Advanced English proficiency skills (C1) required.
  • Must have experience working for US clients

Responsibilities

  • Ensure high availability, reliability, and performance of applications and infrastructure.
  • Define and monitor SLIs, SLOs, and SLAs to maintain service reliability.
  • Implement automation to reduce manual operations and improve system efficiency.
  • Monitor systems, detect anomalies, and respond to incidents in a timely manner.
  • Lead incident management, root cause analysis (RCA), and post-mortem processes.
  • Collaborate with development and DevOps teams to improve system resilience and scalability.
  • Manage observability tools (monitoring, logging, tracing) to gain system insights.
  • Optimize system performance, capacity planning, and cost efficiency.
  • Implement reliability best practices, including redundancy, failover, and disaster recovery.
  • Continuously improve system reliability through proactive engineering initiatives.

Benefits

  • Contractor model
  • Remote model
  • Salary in $USD
  • Paid Vacations
  • Day off for birthdays
  • Benefits courses and/or certifications
  • Opportunity to work with top-tier U.S. clients.
  • Entrepreneurial, multicultural team culture.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service