Sr. Technical Engineer

Employee MagnetsGrapevine, TX
17h

About The Position

Sr. Technical Engineer Summary This role focuses on driving technical excellence by architecting, developing, and continuously improving advanced repair processes for our high-performance AI server infrastructure. This position requires deep hardware expertise, a methodical approach to troubleshooting, and the ability to innovate scalable repair solutions. As a key technical owner, you will lead efforts in designing processes, developing diagnostic tools, conducting root cause analysis, and influencing hardware designs to improve serviceability. Your mission is to establish a center of excellence for AI server repair through engineering rigor, advanced analysis, and reliable processes. Key Responsibilities Process Development & Implementation (Primary Focus): Process Design: Architect, document, and execute end-to-end workflows for diagnosing and repairing AI servers and components. Develop detailed Standard Operating Procedures (SOPs), diagnostic flowcharts, and job-specific work instructions. Tool Development: Design and implement advanced diagnostic software tools, scripts, and physical fixtures to improve accuracy, efficiency, and repeatability of troubleshooting and repair activities. Advanced Validation: Define, test, and validate comprehensive test plans for components and full systems to meet high performance and reliability standards. Process Control: Establish control points within workflows to monitor repair quality and gather repair and failure data for analysis and continuous improvement. Failure Analysis & Advanced Engineering Support (Primary Focus): Problem-Solving Expertise: Serve as the technical escalation point for resolving the most complex hardware issues. Triage, troubleshoot, and drive resolution for rare or unknown failure types. Root Cause Isolation: Conduct deep Root Cause Analysis (RCA), involving schematic interpretation, board-level diagnostics, and meticulous troubleshooting to identify primary causes of failure. Collaboration with Core Engineering Teams: Partner with Product Design, R&D, and Hardware Engineering teams to provide actionable feedback on failure trends and design weaknesses. Collaborate to influence future products with a focus on improved serviceability and reliability. Technical Advancement and Guidance: Development & Training: Create technical resources, training materials, and detailed documentation to propagate advanced diagnostic techniques and repair processes. Knowledge Leadership: Serve as the primary source of technical expertise for the repair center, providing guidance and empowering technicians and engineers with advanced troubleshooting methodologies and engineering insights. Prototyping & Innovation: Drive innovation through iterative prototyping and development of robust repair workflows to improve efficiency and system reliability. Analytics and Continuous Improvement: Data Analysis: Regularly analyze repair data to identify systemic failure trends, optimize existing processes, and track performance metrics such as test yields, repair turn-around times, and cost. Process Optimization: Initiate and lead engineering-driven process improvement projects informed by data analysis to ensure consistent, high-quality repairs. Feedback Loop with Manufacturing & Design: Support continuous improvement by providing actionable insights to manufacturing, design, and quality teams based on repair data and failure modes.

Requirements

  • Education: Bachelors degree in Electrical Engineering, Computer Engineering, Manufacturing Engineering, or a closely related field.
  • Experience: 3+ years of experience in a technical role, such as Test Engineering, Manufacturing Engineering, Hardware Sustaining, or Repair Engineering with a focus on server systems or data center hardware.
  • Proven expertise in developing detailed SOPs, technical workflows, and diagnostic plans for complex electronics.
  • Strong, hands-on experience with hardware diagnostics, schematic analysis, and troubleshooting methodologies, particularly for server systems.
  • Proficiency in scripting and automation tools (e.g., Python, Bash) to streamline testing and data collection processes.
  • Expertise in server architecture and components, including GPUs, high-speed interconnects (InfiniBand/Ethernet), CPUs, and power distribution systems.

Nice To Haves

  • Masters degree in Electrical or Computer Engineering.
  • Experience with Design for Serviceability (DFS) or Design for Manufacturability (DFM).
  • Familiarity with Lean Manufacturing or Six Sigma methodologies.
  • Hands-on experience with advanced repair techniques, such as BGA rework and microsoldering.
  • Experience performing statistical analysis and developing Engineering Change Requests (ECR).

Responsibilities

  • Process Design: Architect, document, and execute end-to-end workflows for diagnosing and repairing AI servers and components. Develop detailed Standard Operating Procedures (SOPs), diagnostic flowcharts, and job-specific work instructions.
  • Tool Development: Design and implement advanced diagnostic software tools, scripts, and physical fixtures to improve accuracy, efficiency, and repeatability of troubleshooting and repair activities.
  • Advanced Validation: Define, test, and validate comprehensive test plans for components and full systems to meet high performance and reliability standards.
  • Process Control: Establish control points within workflows to monitor repair quality and gather repair and failure data for analysis and continuous improvement.
  • Problem-Solving Expertise: Serve as the technical escalation point for resolving the most complex hardware issues. Triage, troubleshoot, and drive resolution for rare or unknown failure types.
  • Root Cause Isolation: Conduct deep Root Cause Analysis (RCA), involving schematic interpretation, board-level diagnostics, and meticulous troubleshooting to identify primary causes of failure.
  • Collaboration with Core Engineering Teams: Partner with Product Design, R&D, and Hardware Engineering teams to provide actionable feedback on failure trends and design weaknesses. Collaborate to influence future products with a focus on improved serviceability and reliability.
  • Development & Training: Create technical resources, training materials, and detailed documentation to propagate advanced diagnostic techniques and repair processes.
  • Knowledge Leadership: Serve as the primary source of technical expertise for the repair center, providing guidance and empowering technicians and engineers with advanced troubleshooting methodologies and engineering insights.
  • Prototyping & Innovation: Drive innovation through iterative prototyping and development of robust repair workflows to improve efficiency and system reliability.
  • Data Analysis: Regularly analyze repair data to identify systemic failure trends, optimize existing processes, and track performance metrics such as test yields, repair turn-around times, and cost.
  • Process Optimization: Initiate and lead engineering-driven process improvement projects informed by data analysis to ensure consistent, high-quality repairs.
  • Feedback Loop with Manufacturing & Design: Support continuous improvement by providing actionable insights to manufacturing, design, and quality teams based on repair data and failure modes.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service