About The Position

Provide helpful and actionable feedback and review for code or production changes. Participate in on-call rotation. Provide design feedback and uplevel design skills of others. Implement and manage SRE monitoring applications using AI, Python, and Observability data. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms. Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand. Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks. Develop and maintain AI-enhanced automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery. Troubleshoot and resolve issues in our dev, test, and production environments. Participate in postmortem analysis and create preventative measures for future incidents. Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments. Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation. Implement and monitor performance metrics to proactively identify and resolve issues. Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises. As an established global company, we offer the benefit of choice. You can choose what your Ford future will look like: will your story span the globe, or keep you close to home? Will your career be a deep dive into what you love, or a series of new teams and new skills? Will you be a leader, a changemaker, a technical expert, a culture builder or all of the above? No matter what you choose, we offer a work life that works for you, including: Immediate medical, dental, and prescription drug coverage Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up child care and more Vehicle discount program for employees and family members, and management leases Tuition assistance Established and active employee resource groups Paid time off for individual and team community service A generous schedule of paid holidays, including the week between Christmas and New Year's Day Paid time off and the option to purchase additional vacation time. For a detailed look at our benefits, click here: Benefit Summary

Requirements

  • Bachelor's degree in Computer Science, Engineering, Mathematics or equivalent work experience.
  • 3+ years of experience as an SRE, DevOps Engineer, Software Engineer or similar role.
  • Strong experience with Python development and desired familiarity with Terraform Provider development.
  • Proficient with monitoring and observability tools.
  • Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.
  • Solid programming skills in Python, with a good understanding of software development best practices.
  • Ability to debug, optimize code, and automate routine tasks.
  • Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.

Responsibilities

  • Provide helpful and actionable feedback and review for code or production changes.
  • Participate in on-call rotation.
  • Provide design feedback and uplevel design skills of others.
  • Implement and manage SRE monitoring applications using AI, Python, and Observability data.
  • Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.
  • Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand.
  • Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.
  • Develop and maintain AI-enhanced automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.
  • Troubleshoot and resolve issues in our dev, test, and production environments.
  • Participate in postmortem analysis and create preventative measures for future incidents.
  • Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies.
  • Participate in security audits and vulnerability assessments.
  • Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand.
  • Analyze trends and make recommendations for resource allocation.
  • Implement and monitor performance metrics to proactively identify and resolve issues.
  • Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster.
  • Participate in regular disaster recovery exercises.

Benefits

  • Immediate medical, dental, and prescription drug coverage
  • Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up child care and more
  • Vehicle discount program for employees and family members, and management leases
  • Tuition assistance
  • Established and active employee resource groups
  • Paid time off for individual and team community service
  • A generous schedule of paid holidays, including the week between Christmas and New Year's Day
  • Paid time off and the option to purchase additional vacation time
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service