Maintenance Engineer, Reliability and Backup Operations

Twin State Technical Services LTD•Davenport, IA

About The Position

The Maintenance Engineer, Reliability and Backup Operations is responsible for the ongoing health, reliability, and operational readiness of customer environments. This role focuses on preventative maintenance and continuous improvement by reducing the recurring causes of avoidable incidents, with a strong emphasis on backup success, recoverability, and stability of critical systems. This position is proactive by design. The Maintenance Engineer identifies patterns across customer environments, drives corrective actions that prevent repeat issues, and helps standardize maintenance routines so systems remain stable, recoverable, and aligned with supported configurations.

Requirements

2+ years in an MSP, systems administration, or technical operations environment.
Experience performing preventative maintenance and remediation across Windows-based environments and common business infrastructure.
Strong troubleshooting skills and ability to identify root cause patterns from recurring operational issues.
Strong written communication skills in a ticketing system.
Ability to manage multiple customer environments with consistent operational discipline.

Nice To Haves

Experience improving backup reliability and performing restore validations in customer environments.
Experience driving operational standardization across multiple environments (runbooks, baselines, repeatable maintenance routines).
Familiarity with MSP tool stacks (ticketing, documentation platforms, backup systems, RMM).
Basic scripting and automation skills (PowerShell and/or Python) to support maintenance efficiency and standardization.
Certifications such as Network+, Security+, Microsoft fundamentals, or equivalent experience.

Responsibilities

Preventative Maintenance and Reliability
Execute preventative maintenance across customer environments to reduce avoidable incidents and repeat tickets.
Identify patterns in recurring operational issues and implement corrective actions to reduce repeat volume over time.
Maintain a cadence of reliability reviews (daily, weekly, and monthly) aligned to customer criticality and agreed service levels.
Improve system readiness by resolving conditions that commonly lead to instability, degraded performance, or preventable outages.
Backup Health and Recoverability
Own the day-to-day health of customer backups, including remediating failures and restoring backup reliability quickly.
Validate recoverability through periodic restore testing and verification activities using approved procedures.
Identify systemic backup issues (authentication failures, agent health, connectivity constraints, retention or storage constraints, configuration drift) and drive them to resolution.
Maintain operational readiness for recovery by ensuring recovery steps are documented, current, and repeatable.
Maintenance Remediation and Operational Readiness
Resolve maintenance conditions that create recurring incidents or increase operational risk (stability blockers, misconfigurations, degraded services, software conflicts, lifecycle issues).
Support OS and application readiness efforts by addressing compatibility blockers, upgrade prerequisites, and configuration drift as appropriate.
Identify and remediate silent failures such as services that stop intermittently, scheduled tasks that fail, recurring update failures, or reliability warnings that indicate future outages.
Coordinate maintenance windows and remediation activities through the Managed Services Team Lead when customer coordination is required.
Standardization and Continuous Improvement
Help standardize maintenance routines and operational baselines across customer environments.
Reduce repeat problems by creating and improving runbooks and repeatable procedures for common issue categories.
Track recurring maintenance themes and recommend improvements to standards, onboarding checklists, and operational procedures.
Collaborate with engineering on systemic fixes when issues are tool-driven, architectural, or exceed defined maintenance scope.
Coordination and Escalation
Escalate to engineering when issues exceed defined scope, require advanced remediation, or indicate significant customer impact.
Escalate to the Managed Services Team Lead for customer communication, dispatch coordination, and onsite actions when needed.
Provide high-quality handoffs including: what was observed, what was validated, actions taken, and recommended next steps.
Support incident recovery work as assigned, focusing on stabilization and returning systems to a healthy baseline.
Documentation and Process Adherence
Document work performed in the ticketing system, including evidence of remediation and follow-up recommendations.
Maintain and improve maintenance runbooks and repeatable procedures for common maintenance conditions.
Follow change management and customer communication procedures during remediation activities.
Ensure maintenance work is consistent, repeatable, and aligned with agreed service standards.