Systems Engineer L3 - HPC/HMD

Power3 Solutions and Partnering CompaniesAnnapolis Junction, MD
8d

About The Position

We are looking for an experienced High-Performance Computing (HPC) Systems Engineer to support complex system design, integration, monitoring, and diagnostics by applying deep understanding of both physical and logical system architectures. Position Description Maintain a comprehensive understanding of the system’s end-to-end physical and logical architecture to effectively apply hardware modeling and diagnostics (HMD) monitoring tools. Leverage HMD monitoring tools to identify, narrow, and triage system issues, directing detailed problems to the appropriate diagnosticians or vendors for resolution. Develop deep expertise in the HMD product and monitoring architecture to identify gaps, inefficiencies, and opportunities to enhance diagnostic effectiveness. Collaborate with developers, analysts, and monitoring tool owners to propose, design, and implement improvements to monitoring solutions, increasing system reliability, and operational visibility. Analyze system logs, metrics, and telemetry—primarily using Splunk—to determine root causes, understand system behavior, and identify anomalous conditions. Interpret hardware and system performance data, including graphs and trends, to diagnose system behavior and inform troubleshooting activities. Guide the development of Splunk dashboards, health indicators, and diagnostic scripts to monitor critical data flows, system performance, and failure signatures. Review and evaluate relevant technical documentation; ask clarifying questions and build expertise in hardware design to support accurate and timely system diagnosis. Provide recommendations for testing strategies and develop documentation of issue signatures to enable and accelerate diagnostics development. Collaborate closely with diagnosis teams and external vendors to troubleshoot complex hardware and system-level issues. Track issues through resolution using JIRA, validate fixes, and confirm that corrective actions resolve the underlying problems.

Requirements

  • Demonstrated experience in one or more of the following technical domains, with a strong willingness and aptitude to expand expertise as required: System Architecture and Design, Power Systems, Printed Circuit Board (PCB) Design, Cooling Infrastructure, Signal Integrity, and System Reliability.
  • Proven experience in system health monitoring, diagnostics, and operational support for complex hardware and integrated systems.
  • Ability to read, interpret, and analyze detailed hardware documentation, including specifications, data sheets, schematics, and design artifacts.
  • Strong analytical and troubleshooting mindset, with a natural curiosity and willingness to ask probing questions to identify root causes and systemic issues.
  • Excellent communication skills, including the ability to engage vendors and internal stakeholders to extract detailed technical information related to system design, performance, and anomalies.
  • Experience with, or demonstrated ability to quickly learn, system monitoring and diagnostics tools, including but not limited to Chiplink, Splunk, Ipsci, and iostat.
  • Self-motivated and capable of completing complex technical tasks with minimal supervision while managing priorities effectively.
  • Ability to clearly convey technical findings, risks, and recommendations to both technical and non-technical audiences through written documentation and oral briefings.
  • Proven ability to work effectively within cross-functional teams, including engineering, diagnostics, operations, and vendor partners.
  • Comfortable learning and adapting to new tools, technologies, and processes as mission and system needs evolve.
  • An active TS/SCI with polygraph is required - Last poly must be within the last 5 years.

Responsibilities

  • Maintain a comprehensive understanding of the system’s end-to-end physical and logical architecture to effectively apply hardware modeling and diagnostics (HMD) monitoring tools.
  • Leverage HMD monitoring tools to identify, narrow, and triage system issues, directing detailed problems to the appropriate diagnosticians or vendors for resolution.
  • Develop deep expertise in the HMD product and monitoring architecture to identify gaps, inefficiencies, and opportunities to enhance diagnostic effectiveness.
  • Collaborate with developers, analysts, and monitoring tool owners to propose, design, and implement improvements to monitoring solutions, increasing system reliability, and operational visibility.
  • Analyze system logs, metrics, and telemetry—primarily using Splunk—to determine root causes, understand system behavior, and identify anomalous conditions.
  • Interpret hardware and system performance data, including graphs and trends, to diagnose system behavior and inform troubleshooting activities.
  • Guide the development of Splunk dashboards, health indicators, and diagnostic scripts to monitor critical data flows, system performance, and failure signatures.
  • Review and evaluate relevant technical documentation; ask clarifying questions and build expertise in hardware design to support accurate and timely system diagnosis.
  • Provide recommendations for testing strategies and develop documentation of issue signatures to enable and accelerate diagnostics development.
  • Collaborate closely with diagnosis teams and external vendors to troubleshoot complex hardware and system-level issues.
  • Track issues through resolution using JIRA, validate fixes, and confirm that corrective actions resolve the underlying problems.

Benefits

  • 100% company-paid health , dental , and vision premiums
  • Automatic company contributed Health Savings Account (HSA) up to $3,900 for families
  • Up to 7 weeks of Paid Time Off (PTO)
  • Automatic 401k Investment
  • Paid 11 Federal Holidays
  • BlueCross BlueShield Health Insurance
  • Tuition/Training Reimbursement
  • Access to Ravens season tickets in club level
  • Company-paid golf events for your time and course fees
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service