Technical Operations Lead

Karsun Solutions, LLCHerndon, VA
Remote

About The Position

This individual will lead technical operations for a cloud-native (AWS) data and AI platform supporting a federal program; own reliability, observability, incident response, platform engineering, and data-product operationalization.

Requirements

  • 10+ years of directly relevant IT work experience.
  • 7+ years technical operations / platform / SRE experience supporting data-intensive systems; 3+ years in AWS production environments.
  • Deep understanding of data products and product ownership: data lineage, stewardship, SLAs, and consumer contracts.
  • Proven experience operating data platforms: Databricks, Airflow, S3, , Kafka/Kinesis, Airflow.
  • Strong SRE practice knowledge: SLI/SLO design, incident response, runbooks, chaos/failure-mode testing.
  • Hands-on with observability tooling (Prometheus, , Datadog, OpenTelemetry) and log/tracing systems.
  • Familiar with IaC (Terraform or CloudFormation), CI/CD (GitHub Actions/Jenkins/ArgoCD), container orchestration (EKS/Kubernetes), and scripting (Python, Bash).
  • Solid security and compliance experience for federal environments (RBAC, encryption, secrets management).
  • Excellent written and verbal communication; ability to produce clear runbooks, RCA reports, and brief leadership.

Nice To Haves

  • AWS Certified Solutions Architect – Associate (desirable).
  • Prior experience with ML lifecycle/MLOps tooling (SageMaker, Databricks) and feature stores.
  • Experience migrating teams from DevOps to SRE and driving organizational change.
  • Experience with cost optimization and governance of large AWS data/ML workloads.
  • Familiarity with federal program processes, change control, and procurement cycles.
  • Active federal clearance or ability to obtain one.

Responsibilities

  • Serve as primary technical owner for platform availability, reliability, and operational runbook development for data pipelines, feature stores, model serving, and supporting infrastructure.
  • Work closely with the SRE Lead to design and operationalize SRE practices (SLIs/SLOs/SLAs, error budgets, toil reduction) to transition teams from DevOps to SRE.
  • In collaboration with SRE Lead, build and maintain monitoring, alerting, and observability across data and AI stacks (ETL/ELT, data lakes/warehouses, model training & serving), including metrics, distributed tracing, and centralized logging.
  • Lead incident management: on-call rotations, incident response, RCA, remediation tracking, and continuous improvement.
  • In collaboration with SRE Lead, automate operational workflows (deployments, scaling, recovery) using IaC (Terraform/CloudFormation) and CI/CD pipelines; reduce manual operational toil.
  • Define and enforce runbooks, backup/restore, RTO/RPO, and disaster recovery for data and ML systems.
  • Partner with data product owners, ML engineers, security, and compliance to ensure production readiness, access controls, and federal compliance requirements.
  • Manage capacity planning, cost optimization, and performance tuning of AWS resources for data and ML workloads.
  • Mentor and lead an ops/SRE team; set technical priorities and coordinate cross-functional platform changes.
  • Maintain vendor and third-party integrations and coordinate upgrades/patching under federal change-control processes.
  • Track and report reliability metrics and operational maturity improvements to stakeholders.

Benefits

  • Commitment to Non-Discrimination
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service