ML Infrastructure/Platform Engineer

Anthelion Capital•New York, NY

2d•$140,000 - $200,000•Hybrid

About The Position

Anthelion is a next-generation investment firm building a proprietary AI and data platform that powers our investment lifecycle from underwriting to portfolio management. The platform integrates structured and unstructured data, advanced analytics, and automated workflows to drive superior, risk-adjusted returns in private credit and structured finance. We are engineers and investors working together to redefine how institutional investment decisions are made — faster, smarter, and more transparent. We are looking for an ML Infrastructure/Platform Engineer to work on the foundational systems that power our data science and AI platform. You will work across the infrastructure layer beneath our ML and AI workflows: data pipelines, orchestration, compute provisioning, model serving, and observability. You will also play a key role in operationalizing our agentic AI platform, ensuring agents are hosted, monitored, and integrated into production-grade systems.

Requirements

3+ years of experience in data engineering, MLOps, or ML infrastructure roles — with a clear track record of building and maintaining production data and ML pipelines.
Strong proficiency in Python and SQL, with hands-on experience building ETL/ELT pipelines and data transformation workflows.
Experience with workflow orchestration tools (Prefect, Airflow, Dagster, or similar) in production environments.
Solid understanding of containerization and cloud infrastructure — Docker, Kubernetes, and at least one major cloud provider (Azure preferred).
Hands-on experience deploying and operating containerized services in cloud environments, including configuring networking, load balancing, and service-to-service connectivity.
Experience with model serving and deployment patterns (batch inference, real-time APIs, feature stores).
Familiarity with monitoring and observability tooling for pipelines and deployed models (data drift detection, health metrics, alerting).
Strong documentation habits and the ability to communicate technical architecture clearly to diverse stakeholders.

Nice To Haves

Experience with Azure services: Container Apps, ACI, ACR, Blob Storage, Key Vault, Managed Identities, VNets.
Familiarity with Prefect (especially cloud-managed work pools, result backends, and HITL patterns).
Experience with dbt, Snowflake, or similar data transformation and warehousing tools.
Exposure to LLM serving infrastructure and agentic workflow frameworks (e.g., MCP, LangChain, or similar).
Experience standing up and maintaining third-party AI/ML platform tools (e.g., Langfuse, MLflow, or similar observability and evaluation platforms).
Experience managing internal Python package distribution (private PyPI, Artifactory, or similar).
Familiarity with Git-based release management, branch protection, and CI/CD for data science repos.

Responsibilities

Design, build, and maintain production data pipelines that ingest, transform, and deliver structured and unstructured data to downstream ML workflows.
Own and extend our Prefect-based orchestration layer, including flow scheduling, error handling, retry logic, and human-in-the-loop (HITL) suspend/resume patterns.
Build and maintain feature stores, data contracts, and promotion workflows that ensure data quality and traceability from raw ingestion through model consumption.
Collaborate with data scientists to operationalize experimental workflows into reliable, repeatable pipelines.
Build and maintain scalable infrastructure for model training, retraining, and inference (batch and real-time), including GPU compute provisioning and container orchestration.
Implement and manage model serving infrastructure — including containerized endpoints, API gateways, and self-serve deployment frameworks for the data science team.
Deploy and manage monitoring systems that track model health, data drift, prediction consumption, and pipeline reliability.
Ensure all deployed systems are highly available, resilient, and well-documented with clear data lineage and runbooks.
Support the buildout and operationalization of agentic AI workflows, including agent hosting, lifecycle management, and integration with Model Context Protocol (MCP) servers.
Build shared tooling and infrastructure that enables data scientists to develop, test, and deploy agents with minimal friction.
Design and implement evaluation frameworks and quality standards for AI agents, including automated benchmarking, regression testing, and production-readiness criteria.
Ensure observability and reliability across agent execution environments, including logging, tracing, and performance monitoring.
Deploy, configure, and maintain shared AI platform services (e.g., observability tools, memory layers, evaluation platforms) as containerized workloads on Azure — including end-to-end ownership of networking, access, and connectivity between services.
Manage cloud infrastructure (Azure) including container registries, managed identities, Key Vault secrets, storage backends, and virtual network configurations.
Maintain CI/CD pipelines, branch protection policies, and release management workflows across data science repositories.
Continuously evaluate and adopt tools and technologies that improve platform reliability, developer experience, and team velocity.