About The Position

Toyota Financial Services is building out a new Site Reliability Engineering (SRE) team for application domains, and we are seeking senior SRE engineers to ensure reliability, performance and availability of the applications within each domain. As a senior SRE engineer - applications, you will be working with development engineers, product owners, SRE Infrastructure, production engineers and Technology Operations Center personnel with a primary focus on improving observability, automation, overall system health, reliability and uptime.

Requirements

  • Experience with DevOps tools like GitHub, Harness & Dynatrace.
  • Experience building self-healing systems and automated remediation workflows.
  • Experience in Site Reliability Engineering, DevOps, or related field.
  • Demonstrated experience in problem-solving, key SRE/DevOps concepts & tools with a proven track record of achieving high system reliability and performance.
  • Strong experience with Terraform for AWS IaC.
  • Proficient in scripting and automation with Python and familiar with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
  • Deep knowledge of container orchestration (Kubernetes/EKS).
  • Deep understanding of cloud platforms (e.g., AWS, GCP, Azure) and container orchestration technologies (e.g., Kubernetes).
  • Effective communication skills, with the ability to convey complex technical concepts to diverse audiences.

Nice To Haves

  • AWS certifications (DevOps Engineer, Solutions Architect, etc.).
  • Familiarity with GitOps, secrets management, and infrastructure monitoring best practices.
  • Experience building self-healing systems and automated remediation workflows.

Responsibilities

  • Design, code, and maintain automation to streamline operations, reduce manual tasks, and improve system efficiency to enable a robust application environment.
  • Working with observability engineers to enable actionable insights into applications and infrastructure health and performance.
  • Foster a collaborative team-culture and support professional development.
  • Ensure scalable & repeatable code deployments with CI/CD pipelines using GitHub & Harness, repeatable deployments with infrastructure as code (IaC) using Terraform.
  • Build automation and operational runbooks primarily using Python scripting.
  • Manage container orchestration platforms and related cloud-native services.
  • Drive reliability improvements through Service Level Objectives (SLOs), error budgets, and Service Level Agreements (SLAs) aligned with business goals.
  • Design & implement observability improvements using Dynatrace & CloudWatch.
  • Lead major incident responses and coordinate with stakeholders for resolution and drive problem management to prevent recurrence.
  • Conduct blameless post-incident reviews and drive continuous improvement.
  • Collaborate cross-functionally to embed SRE principles into application design and operation meeting reliability goals.
  • Participate in architectural reviews, providing input on reliability and scalability.

Benefits

  • A work environment built on teamwork, flexibility, and respect.
  • Professional growth and development programs to help advance your career, as well as tuition reimbursement.
  • Team Member Vehicle Purchase Discount
  • Toyota Team Member Lease Vehicle Program (if applicable)
  • Comprehensive health care and wellness plans for your entire family.
  • Toyota 401(k) Savings Plan featuring a company match, as well as an annual retirement contribution from Toyota regardless of whether you contribute.
  • Paid holidays and paid time off.
  • Referral services related to prenatal services, adoption, childcare, schools, and more.
  • Relocation assistance (if applicable).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service