About The Position

We are seeking a senior MLOps Architect to design and scale a modern ML and Generative AI platform across AWS. This role will own the architecture for traditional ML and LLM/Generative AI pipelines, ensuring production reliability, governance, cost optimization (FinOps), and enterprise-grade security. The ideal candidate has deep expertise in AWS, SageMaker, Databricks, Atlan (data catalog/governance), and modern MLOps tooling, and understands how to operationalize LLMs, RAG systems, and foundation models within a governed, scalable MLOps stack. This is a strategic, hands-on architecture role responsible for integrating GenAI capabilities into an enterprise ML platform.

Requirements

  • 6+ years of experience in ML engineering, data engineering, or MLOps roles.
  • Proven experience architecting ML platforms in AWS.
  • Strong hands-on experience with SageMaker (training, pipelines, deployment).
  • Experience operationalizing LLM or Generative AI systems in production.
  • Experience building RAG pipelines and integrating vector databases.
  • Experience working with Databricks in production.
  • Experience implementing data governance and catalog systems (e.g., Atlan).
  • Strong understanding of CI/CD principles for ML and GenAI.
  • Experience with containerization (Docker) and orchestration (Kubernetes/EKS).
  • Deep knowledge of infrastructure-as-code (Terraform, CloudFormation).
  • Strong understanding of observability and monitoring for ML systems.
  • Experience implementing cloud cost optimization strategies (FinOps).
  • Strong Python proficiency.
  • Experience with foundation model fine-tuning and parameter-efficient methods.
  • Experience implementing model registries and experiment tracking tools.
  • Experience designing feature stores and embedding stores.
  • Familiarity with AI risk management, bias mitigation, and safety controls.
  • Experience supporting regulated or data-sensitive environments.
  • Platform-level architectural thinking.
  • Deep understanding of how to integrate GenAI into enterprise ML ecosystems.
  • Ability to balance scalability, governance, security, performance, and cost.
  • Strong technical leadership and cross-functional collaboration skills.
  • Hands-on ability to move from architecture design to implementation.

Responsibilities

  • Design and implement scalable ML and LLM infrastructure on AWS (SageMaker, EKS, S3, IAM, Lambda, Step Functions, CloudWatch).
  • Architect end-to-end ML and Generative AI lifecycle workflows: Data ingestion & preprocessing, Feature engineering / embedding generation, Model training & fine-tuning (traditional ML + foundation models), Model evaluation & validation, Deployment (real-time, batch, streaming), Monitoring & retraining.
  • Integrate LLM pipelines (prompt workflows, RAG architectures, fine-tuning flows) into the enterprise MLOps stack.
  • Define standards for CI/CD/CT pipelines across ML and GenAI workloads.
  • Architect Retrieval-Augmented Generation (RAG) pipelines including: Embedding generation workflows, Vector database integration, Document ingestion and chunking strategies, Retrieval evaluation and monitoring.
  • Design and deploy LLM-based services using: Managed services (e.g., SageMaker endpoints, Bedrock-style APIs), Containerized custom inference services.
  • Establish prompt versioning, evaluation frameworks, and experiment tracking for LLM systems.
  • Implement guardrails for hallucination control, safety monitoring, bias detection, and usage logging.
  • Define architecture for LLM fine-tuning workflows (including data curation, evaluation, and cost controls).
  • Implement scalable orchestration of LLM pipelines using workflow engines and event-driven patterns.
  • Architect scalable inference patterns for: Traditional ML models, LLM APIs, RAG systems.
  • Implement model monitoring frameworks for: Performance degradation, Drift detection, LLM output quality, Latency and token usage metrics.
  • Define SLAs/SLOs for ML and GenAI systems.
  • Design safe deployment strategies (blue/green, canary, shadow testing).
  • Establish logging, observability, and traceability standards for GenAI systems.
  • Implement cost tracking for: Training workloads, GPU utilization, Inference endpoints, Token consumption (LLM APIs), Vector database storage.
  • Optimize LLM workloads for cost-performance tradeoffs (model size, batching, caching strategies).
  • Design autoscaling and compute optimization strategies for GPU and CPU-based inference.
  • Partner with finance and engineering teams to forecast ML/GenAI infrastructure spend.
  • Define enterprise standards for: Experiment tracking, Model registry, Prompt registry, Artifact management, Embedding versioning.
  • Provide architectural guidance to data science, AI, and engineering teams.
  • Evaluate and recommend tooling across the ML/GenAI stack (MLflow, feature stores, vector databases, orchestration tools).
  • Drive documentation and reusable patterns for ML and GenAI development.

Benefits

  • Competitive Base Salary Range of $117,800 – $189,000
  • Annual Incentive Compensation Eligibility – Up to 10% annually
  • Health Insurance: Comprehensive medical, dental, and employer-paid vision plans through UnitedHealthcare (UHC), with various coverage levels available to meet the needs of our employees and their families. Additional perks through UHC include: Sweat Equity, free subscription to the Calm App, UHC rewards, Real Appeal, and Quit For Life.
  • Flexible Spending Account: Set aside pre-tax dollars from your paycheck to pay for qualified out-of-pocket medical, dental, vision, pharmacy or dependent care expenses.
  • Lifestyle Spending Account: Employer sponsored post-tax benefits that allow reimbursement for expenses related to physical, mental and financial well-being.
  • 100% Company Paid Insurances: Kapitus fully covers the cost of basic short-term and long-term disability insurance, as well as vision insurance, ensuring our employees have comprehensive protection without any personal expense.
  • Voluntary Insurance: Supplemental life insurance as well as enhanced short- and long-term disability coverage are available through Mutual of Omaha, providing additional security for our employees. Additionally, Colonial Accident and Hospitalization insurances are also available, offering further protection against unforeseen events.
  • Paid Maternity and Parental Leave: Beyond state-mandated leave policies, Kapitus provides company-paid maternity and parental leave, supporting our employees during important family milestones.
  • Commuter Benefits: We offer pre-tax benefits on parking and commuter expenses to cover travel to and from work.
  • LifeBalance Program: Enhance your lifestyle with our LifeBalance membership, which offers discounts on outdoor activities, the arts, health, and fitness. Additional benefits include: Pet and car insurance discounts. Financial services such as LegalShield. Relaxation and stress management tools.
  • Plum Benefits Discount Program: Access exclusive discounts on shows, travel, car rentals, and more, enriching your personal and family life.
  • Tuition Reimbursement: Pursue further education with up to $5,000 annually in tuition reimbursement, plus opportunities to attend relevant conferences and career development events. Managed through our LSA plan, Kapitus Academy.
  • Travel Reimbursement: We also offer travel reimbursement for all work-related travel, supporting your involvement in career and personal development activities.
  • Paid Time Off and Sick Time.
  • Retirement Benefits: Our 401K plan is managed through Fidelity. To support your long-term financial goals, the company provides a 25% match on your contributions, up to 6% of your annual salary.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service