Software Engineer - E5 (Kubernetes)

WhatfixSan Jose, CA
45d

About The Position

Position Overview: We are looking for a highly skilled and experienced Software Engineer (E5) to join our Site Reliability Engineering team who can take end‑to‑end ownership of large, business‑critical features. You'll design, build, ship, and operate reliable, scalable services; break complex work into actionable tasks for yourself and other engineers; set the technical bar through thoughtful design and rigorous reviews; and mentor teammates while partnering with product, platform, and customer‑facing groups to keep our systems fast, observable, and always‑on. Responsibilities: Scope & Impact: This role is critical to enhancing the reliability, availability, and overall resilience of Whatfix's software products. The role will own these Non Functional Areas and build automated mechanisms to target gaps in these areas. These automated mechanisms should be scalable to an extent where other Engineering Teams can build their own pipelines to ensure reliability for their owned services. The role should be able to build a framework which can democratize the approach to enhance observability, recoverability and self healing capabilities of the products in Whatfix EcoSystem. This should also provide visibility to other engineering systems on the performance of their microservices.

Requirements

  • Candidate should have experience in the following technologies
  • Strong experience in Java.
  • Working experience in Kubernetes, Helm, ArgoCD
  • Ability to work with Java and Python based applications and identify gaps that could result in failures.
  • Familiarity with CI/CD pipelines and infrastructure as code (IaC) practices.
  • Strong problem-solving and troubleshooting abilities.
  • Excellent communication and collaboration skills.
  • Ability to mentor and guide cross-functional teams.

Nice To Haves

  • Familiarity with log aggregation tools (e.g., ELK Stack).
  • Knowledge of Chaos Engineering principles.

Responsibilities

  • Designs and ships scalable platform code that bakes‑in reliability, fault‑tolerance and self‑healing for all Whatfix products
  • Owns, designs and develops frameworks (eliminate or significantly reduce manual efforts, e.g., through self-healing and auto-scaling systems, and platformization), processes and architecture which enhances the Availability and Reliability of the System.
  • Provides as a first responder for critical software issues within the team's domain.
  • Prioritizes and takes ownership of unowned or complex tasks that enable the team to move faster.
  • Ensure that customer issues are not just fixed but that effective long-term solutions are implemented to prevent recurrence.
  • Own task breakdown from stories/features, ensuring each task is feasible within five days
  • Detail out design documents for the features being worked on
  • Implement well tested and documented code based on engineering standards and best practices
  • Own and support the features owned by the team to ensure high availability and compliances
  • Review designs and code written by peers as well as other teams from perspectives of testability, maintainability, reliability, security and cost.
  • Work with other teams to enhance developer experience through the enhancement of developer tools, suggest and implement AI workflows in the area of observability, availability and reliability
  • Demonstrate expertise in one or more technical areas and contribute to the overall technical direction of the team.
  • Increasing the observability of Software Systems
  • Managing Infrastructure in automated manner (utilizing automated pipelines for CI/CD and frameworks for IaaC)
  • Identifying gaps in Monitoring and Observability and fixing such gaps in a sustainable, scalable and automated manner.
  • Proven track record of defining SLAs for Systems and working on tasks to continuously track these SLAs and enhancing these SLAs
  • Resilience Engineering Practices: Drives post‑incident blameless RCAs and converts findings into code, tests and platform improvements
  • Working with other teams to help enhance the observability and recoverability (such as through self healing) of those team's features
  • Conduct training sessions or workshops on observability and reliability practices.
  • Provide guidance on best practices for monitoring, alerting, and logging.

Benefits

  • Uncapped incentives
  • Equity plan
  • Mac shop, work with the newest technologies
  • Unlimited PTO policy
  • Paid maternity/paternity leave
  • Monthly cell phone stipend
  • Paid UberEats lunches-daily
  • Medical, Dental, and Vision coverage (Whatfix pays 80% of the premium for individuals and their families; for the HSA, Whatfix contributes $1,000 for individuals and $2,000 for a family)
  • Team and company outings
  • Learning and Development benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Publishing Industries

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service