Reliability & Monitoring Engineer II

Nextpower•Nashville, TN

1d•Onsite

About The Position

The Reliability & Monitoring Engineer II is responsible for fleet-level monitoring, incident analysis, and reliability insights for Nextpower-supported utility-scale solar tracker systems. This role provides real-time system visibility, post-event analysis, and actionable intelligence that support rapid recovery and long-term asset reliability, particularly following severe weather and other high-impact events. This position goes beyond basic monitoring execution, expecting the engineer to own complex investigations, help shape monitoring logic and workflows, and act as a technical leader within the Remote Monitoring Center (RMC). The ideal candidate brings strong experience in robotics, software, APIs, and/or SRE/operations in complex distributed or cyber-physical systems and applies those skills to a new domain (solar and trackers). Operating within a portfolio-based support model, the Reliability & Monitoring Engineer translates monitoring data into clear technical insights that improve system uptime, inform customer communication, and strengthen long-term asset performance. This is a desk-based role within the NEXTpower organization, focused on proactive monitoring, analytical investigation, and continuous operational improvement, working closely with the U.S. Technical Services organization and the Manager, Remote Monitoring & Asset Resilience (U.S.). The role operates within a structured coverage model in the Remote Monitoring Center, with engineers working staggered shifts to maintain daytime and early evening monitoring coverage and ensure effective handoffs between team members. Key Objectives include delivering high-quality fleet monitoring, leading incident analysis and root cause investigation, supporting technical services and customer communication, and driving reliability insights, automation, and operational improvement.

Requirements

Bachelor’s degree in Engineering, Computer Science, Mechatronics/Robotics, Electrical Engineering, or a related technical field; equivalent relevant experience will be considered.
4+ years of experience in reliability engineering, SRE/operations, robotics/automation, fleet monitoring, or operations centers dealing with complex distributed or cyber-physical systems.
Strong experience with monitoring, automation, or control of complex systems, such as robotics, manufacturing automation, OT/ICS, data centers, or cloud services.
Demonstrated experience performing root cause analysis using operational and monitoring data (metrics, logs, time-series, event histories), including structured post-incident reviews.
Strong analytical skills with high attention to detail and a structured, data-driven problem-solving approach.
Clear technical writing skills and the ability to communicate findings to both technical and non-technical audiences, including customers and senior stakeholders.
Proficiency with web-based monitoring platforms, observability stacks, or fleet analytics tools.
Ability to interpret time-series data, alarms, and event logs to diagnose performance and reliability issues across a fleet of assets.
Strong comfort using tools such as Python, SQL, Excel, or similar analytical tools for data analysis, visualization, and reporting.
Experience working with APIs and data integration (e.g., REST APIs, webhooks, log/metrics pipelines) to move data between systems or automate routine monitoring tasks.
Understanding of weather-driven operational risk, or demonstrated ability to reason about external risk factors impacting system performance.
Strong written and verbal communication skills, with the ability to craft concise incident summaries, RCA documents, and status updates.
Proven ability to work cross-functionally with Technical Services, Engineering, Product, and Operations teams, often across time zones.
Customer- and stakeholder-focused mindset, ensuring information is accurate, timely, and tailored to audience needs.
Ability to influence and drive adoption of improved monitoring practices and standards, even without formal people management responsibilities.
Strong organizational skills with the ability to prioritize and manage multiple events and monitoring tasks concurrently in an incident-driven environment.
Reliability and consistency in following established SOPs, workflows, and documentation standards, while also identifying where they should evolve.
Adaptability to evolving operational needs, portfolio growth, and changes in monitoring tools or processes.
Comfort operating in a fast-paced environment that may require occasional support during off-hours events as required by coverage models.
Demonstrated ownership mindset—takes initiative to identify problems, propose solutions, and follow through to implementation and measurement.

Nice To Haves

Prior experience in solar, energy, or grid operations is a plus but not required; must be comfortable learning a new physical domain (PV, trackers, inverters, weather impacts).
Familiarity with NX Navigator or similar systems is highly desirable but can be learned.
Experience with robotics, control systems, or automation (e.g., embedded systems, motion control, industrial protocols) is a strong plus.
Familiarity with dashboarding and analytical tools such as Power BI and Databricks is a nice to have, particularly for building or interacting with reliability and performance dashboards.

Responsibilities

Monitor utility-scale solar tracker fleets using web-based monitoring platforms, including NX Navigator, to maintain real-time awareness of system status.
Identify abnormal system states, communication failures, and offline assets across assigned customer portfolios, and drive deeper analysis of patterns across multiple sites.
Support remote operational actions during high-wind and severe weather events, including coordination of tracker stow and recovery activities under the direction of the Manager, Remote Monitoring & Asset Resilience.
Maintain clear situational awareness across active customer sites, including key alarms, stow states, communication health, and emerging risk signals.
Log and track monitoring observations, ensuring key events are captured in internal systems and aligned with established RMC workflows and SOPs.
Provide input into coverage models, alert tuning, and monitoring standards to improve RMC effectiveness and reduce alert fatigue.
Perform structured Root Cause Analysis (RCA) for system alarms, outages, and post-weather events using operational data, logs, SCADA-like signals, and environmental inputs.
Correlate tracker behavior, monitoring signals, and weather data to determine probable failure mechanisms and reliability risks.
Produce clear, technically sound incident summaries and RCA documentation for customers, Technical Services, and internal stakeholders.
Support warranty-aligned documentation and evidence collection, ensuring events are captured in a way that supports potential warranty claims and risk assessments.
Participate in and, where appropriate, lead post-event reviews, providing data-driven input on incident timelines, system behavior, and key contributing factors.
Use experience with software systems, APIs, or robotics/automation to propose more robust detection mechanisms, health checks, or automated validation routines.
Provide monitoring-based technical analysis to support customer issues managed by the Technical Services team and other customer-facing functions.
Translate complex system behavior into clear, actionable insights that enable Technical Services to prioritize and execute field or remote actions.
Ensure that incident records, timelines, and findings meet internal service expectations and quality standards for accuracy, completeness, and clarity.
Support preparation of materials for customer calls, reports, and follow-ups by supplying data extracts, charts, and concise technical summaries derived from monitoring platforms.
Act as a trusted technical partner to Technical Services, helping refine what “good” analysis and documentation look like for high-priority incidents.
Identify recurring issues, performance degradation patterns, and systemic reliability risks across the monitored fleet, using both manual analysis and analytical tooling.
Recommend improvements to monitoring thresholds, alerting logic, and response workflows, helping to reduce false alarms and improve signal-to-noise ratio.
Use experience with APIs, scripting, and automation (e.g., Python, REST APIs, data pipelines) to suggest or prototype improvements that: Reduce manual data pulls, Standardize common analyses, or, Improve visibility into key reliability indicators.
Support refinement of monitoring tools, dashboards, and operational playbooks in partnership with the Manager, Remote Monitoring & Asset Resilience and cross-functional stakeholders.
Participate in pilots or trials of new monitoring features, analytics capabilities, or alert configurations, providing structured feedback on effectiveness and usability.
Take ownership of at least one improvement area (e.g., a class of alarms, a dashboard, a subset of sites, or a specific reliability theme) and drive it from problem definition through to measurable impact.
Partner with Engineering, Product, Operations, and Technical Services teams to share monitoring-based field intelligence and support long-term reliability improvements.
Contribute to the creation and maintenance of SOPs, monitoring playbooks, training materials, and internal knowledge bases used by the Remote Monitoring Center.
Document findings, workflows, and lessons learned in a clear and reusable format to support team scaling and onboarding.
Support knowledge sharing and best-practice development within the monitoring and reliability team, including informal coaching or mentoring of other engineers on tools, workflows, and analysis methods.
Bring a software/robotics/system design perspective into conversations with Product and Engineering, helping to translate field/monitoring signals into concrete product or control-system changes.