Engineering System Reliability (ESR) engineer

SamsungTaylor, TX
Onsite

About The Position

About Samsung Austin Semiconductor Samsung is a world leader in advanced semiconductor technology, founded on the belief that the pursuit of excellence creates a better world. At SAS, we are Innovating Today to Power the Devices of Tomorrow. Come innovate with us! Position Summary The Engineering System Reliability (ESR) engineer is a key member of the Site Reliability Engineering (SRE) organization, responsible for the 24 × 7 operational health of the EES family of engineering systems and related platforms. In this role you will provide continuous monitoring, rapid incident response, and root‑cause analysis for high‑availability services, while working closely with SRE, Developer, and SIOC teams to drive automation, capacity planning, and seamless system migrations. Your primary focus will be on maintaining and evolving monitoring frameworks (e.g., Ontune, UIM, Splunk, Prometheus, Grafana) and developing CI/CD pipelines that enable reliable, repeatable deployments across environments. Additionally, duties will include end of life architecture planning and other long term system stability tasks, covering any necessary action for high availability of critical MES fab operation systems. You will also design and maintain backend tooling and scripts that automate health‑sensor data collection, performance dashboards, and database operations for Oracle and SQL Server. Leveraging strong troubleshooting expertise, you will lead post‑mortem activities, implement corrective counter‑measures, and ensure that all technical documentation is clear, comprehensive, and written in fluent English. The role demands a collaborative mindset—balancing independent problem‑solving with teamwork—to support Samsung’s manufacturing ecosystem around the clock.

Requirements

  • BS Degree in Computer Science/Engineering or related major with 5-7 years' experience in a software development or DevOps role.
  • Must have 2+ years' experience with monitoring tools such as Ontune, UIM, Splunk, ITSI, Prometheus, or Grafana.
  • Knowledge in programming languages such as shell scripting, C# .NET, Java, or web development frameworks, preference on backend coding experience.
  • Strong troubleshooting skills (SPS certification preferred) with experience diagnosing complex issues leading into root cause analysis and countermeasure implementation.
  • Extensive database knowledge with 3-5+ years' experience developing and operating Oracle & SQL server databases and related technologies.
  • Knowledge/experience with both Windows and Linux environments.
  • Capability to document technical details in depth with expectation of strong English language skills.
  • Comfortable working both independently and in a team environment with emphasis on 24x7 support of manufacturing environment.

Responsibilities

  • Provide 24/7 operational support and continuous monitoring.
  • Respond to alerts and incidents in a timely manner, troubleshooting and resolving issues, following up with SRE postmortem activities for root cause analysis.
  • Implement and maintain monitoring tools and dashboards to track system heath and performance alongside DevOps pipeline development for CPM automation via CI/CD.
  • Evaluate current system capacity and plans for future growth, collaborate with teams to ensure scalability in both software and hardware, executing regular maintenance tasks and coordinating preparation and execution of system migration.
  • Develop various scripts for process monitoring and automation of operating ranging covering various scripting languages as needed.
  • Continuous improvement of system health sensors, deployment and maintenance techniques, and general introduction of CI/CD methods where applicable.
  • Ultimately deliver exceptional monitoring, rapid recovery, change point assistance, architecture management, and migration coordination support.

Benefits

  • Medical, dental, and vision insurance
  • Life insurance and 401(k) matching with immediate vesting
  • Onsite café(s) and workout facilities
  • Paid maternity and paternity leave
  • Paid time off (PTO) + 2 personal holidays and 10 regular holidays
  • Wellness incentives and MORE
  • Eligible full-time employees (salaried or hourly) may also receive MBO bonuses based on company, division, and individual performance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service