About The Position

We are seeking an experienced and highly motivated Sys/Cloud Admin/Incident Response Engineer to support enterprise monitoring operations, incident detection, response activities, and operational situational awareness for a mission-critical platform within the Department of Veterans Affairs (VA) environment. In this role, you will provide hands-on administration and operational support to help ensure monitoring and incident management processes effectively sustain system reliability, operational continuity, and rapid restoration of services across a large-scale, 24x7 enterprise healthcare platform. You will work closely with the Monitoring & Incident Management Manager, Program Manager, Technical Directors, DevSecOps & SRE teams, and VA stakeholders to identify, escalate, communicate, and help resolve incidents in alignment with strict service-level expectations and operational standards.

Requirements

  • Bachelor’s degree in Information Technology, Computer Science, Engineering, Cybersecurity, or a related field; equivalent relevant experience may be considered.
  • 3+ years of experience in systems administration, cloud operations, site reliability, network operations, incident response, or enterprise production support roles.
  • Hands-on experience supporting Windows and/or Linux server environments, cloud-hosted infrastructure, and enterprise application platforms.
  • Experience with monitoring, logging, and observability tools used to detect, investigate, and troubleshoot service disruptions.
  • Working knowledge of incident management processes, ticketing workflows, escalation practices, and service restoration procedures in ITIL-aligned environments.
  • Ability to analyze logs, alerts, and system behavior to support troubleshooting and rapid issue resolution.
  • Strong written and verbal communication skills, with the ability to document incidents and coordinate effectively across technical and non-technical stakeholders.
  • Ability to work in a 24x7, SLA-driven environment and participate in operational response activities under time-sensitive conditions.
  • Candidates must be eligible to obtain and maintain a Public Trust clearance.

Nice To Haves

  • Experience supporting VA or other Federal Government environments, including familiarity with operational reporting, service management, and compliance expectations.
  • Experience with cloud and platform technologies such as AWS, Azure, Kubernetes, container platforms, virtualization, or hybrid infrastructure.
  • Familiarity with enterprise monitoring and observability platforms such as Splunk, Dynatrace, CloudWatch, Azure Monitor, Grafana, or similar tools.
  • Experience using scripting or automation tools such as PowerShell, Python, Bash, or infrastructure automation frameworks to streamline operational tasks.
  • Exposure to DevSecOps, Site Reliability Engineering (SRE), SAFe Agile, or modern incident response and post-incident review practices.
  • Relevant certifications such as AWS Certified SysOps Administrator, Azure Administrator Associate, CompTIA Security+, ITIL Foundation, Splunk, or similar credentials.

Responsibilities

  • Administer, monitor, and support cloud and platform services, virtual infrastructure, and hosted applications to maintain system health, availability, and performance.
  • Configure, tune, and maintain monitoring, logging, and alerting solutions to improve visibility across infrastructure, applications, and service dependencies.
  • Validate alert accuracy, reduce noise, and help ensure operational issues are detected proactively through effective observability practices.
  • Perform routine system administration tasks such as environment checks, service restarts, access support, patch coordination, and operational maintenance activities.
  • Monitor incident queues and system alerts, perform initial triage, document impact, and execute defined escalation procedures for incidents affecting mission-critical services.
  • Participate in major incident response activities, including troubleshooting, log review, coordination with engineering teams, and support for service restoration efforts.
  • Follow incident response playbooks, severity models, and communication protocols to support timely resolution and accurate status reporting.
  • Document incident timelines, actions taken, recovery steps, and supporting evidence to enable post-incident review and continuous improvement.
  • Support coordination during operational events by working across infrastructure, application, DevSecOps, SRE, and service management teams.
  • Provide clear, timely updates on incident status, service impact, troubleshooting progress, and recovery actions to internal stakeholders.
  • Escalate issues appropriately based on impact, urgency, and established operational procedures.
  • Maintain accurate operational records in ticketing, incident, and knowledge management systems.
  • Partner with engineers and platform teams to improve dashboards, alerts, runbooks, and operational procedures supporting reliable service delivery.
  • Identify recurring operational issues, alert gaps, and system weaknesses, and recommend practical improvements to reduce incident frequency and response time.
  • Support automation efforts for routine operational tasks, alert correlation, remediation workflows, and incident response activities where applicable.
  • Contribute to post-incident reviews, root cause analysis activities, and implementation of corrective or preventive actions.
  • Help maintain operational reporting on incidents, system health, availability, and response metrics to support service-level objectives and operational reviews.
  • Ensure incident records, escalation paths, standard operating procedures, and response documentation remain current and usable.
  • Support compliance with operational policies, security requirements, and change management practices in cloud and enterprise environments.
  • Participate in on-call or after-hours operational support, as required, in a 24x7 mission-driven environment.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service