Senior Tier-4 Model Serving Support Lead

ECS Tech IncFairfax, VA
Onsite

About The Position

The War Data Platform (WDP) is a key initiative within the U.S. Department of War's (DoW) AI-First strategy introduced in early 2026. The WDP focuses on operational warfighting data and aims to accelerate the deployment of artificial intelligence (AI) on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts. The Senior Tier-4 Model Serving Support Lead serves as the authoritative escalation owner for AI and machine learning model-serving pipelines, production endpoints, and model zoo operations across WDP Core Integration's full multi-enclave environment. This role bridges platform engineering, cybersecurity, and cross-service mission partners to sustain uninterrupted AI model-serving performance in direct support of DoW missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.

Requirements

  • Current Secret security clearance with the ability to obtain and maintain a Top Secret (TS) security clearance with Sensitive Compartmented Information (SCI).
  • 10 or more years of progressive experience in AI/ML platform operations, enterprise incident management, or senior IT support roles, with demonstrated responsibility for Tier-4 or equivalent escalation ownership in classified or federal government multi-enclave cloud environments.
  • Hands-on experience applying enterprise observability and container orchestration tooling, including Kubernetes, GitLab CI, Elastic Stack, Prometheus, and Grafana, to diagnose AI/ML serving failures, analyze pipeline telemetry, and coordinate stabilization activities across Unclassified, Secret, and Top Secret network environments.
  • Demonstrated experience coordinating with DoW-authorized DevSecOps platform environments such as Platform One or Cloud One, including participation in cross-enclave release readiness activities, rollback validation, and post-deployment stability verification for AI/ML model-serving workloads.
  • CompTIA A+ certification or equivalent, demonstrating validated foundational knowledge of IT systems, hardware, software, and operational support practices.
  • Strong problem-solving and decision-making capabilities, with a proven ability to weigh the relative costs and benefits of potential actions and identify the most appropriate solution.
  • Highly developed interpersonal and oral/written communication skills, with the ability to effectively and professionally interact with a diverse set of stakeholders (from peers to end-users to executive management).

Responsibilities

  • Owns Tier-4 escalation coordination for artificial intelligence and machine learning model-serving pipelines, production endpoints, and model zoo operations within War Data Platform (WDP) Core Integration environments supporting Department of War missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
  • Directs escalation workflows by activating incident bridges, coordinating engineering response actions, validating operational impact, and aligning escalation playbooks with service-level agreement requirements.
  • Applies Kubernetes, GitLab Continuous Integration, VMware environments, Elastic Stack, Prometheus metrics, Grafana dashboards, and enterprise observability tooling to diagnose serving failures, analyze telemetry, and guide stabilization activities across unclassified and higher-domain enclaves.
  • Leads coordination with Platform One, Cloud One, multi-national engineering teams, and cross-service mission partners to maintain operational readiness for serving pipelines, cross-domain transfer workflows, API endpoints, and model-runtime components.
  • Conducts structured post-incident analysis by collecting operational evidence, reconstructing failure sequences, validating remediation steps, and documenting mission-assurance considerations for future release cycles.
  • Produces mission-critical deliverables including escalation playbooks, incident-response documentation, service-level alignment reports, operational risk assessments, and restoration summaries.
  • Strengthens program value by reinforcing deployment consistency, advancing mission assurance posture, and sustaining operational continuity across all enclaves.
  • Supports enterprise release operations by coordinating readiness checks, validating rollback pathways, and maintaining authoritative Tier-4 support artifacts required for uninterrupted artificial intelligence model-serving performance.
  • Performs other duties as assigned.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service