AI Engineering Lead

DEUNA•San Francisco, CA

About The Position

Athia is DEUNA's AI-powered payment intelligence platform, moving from early ML experimentation to the critical infrastructure behind billions of dollars in annual transaction volume. We are looking for a hands-on Engineering Lead who can own the full technical stack: from model development and data pipelines to production payment orchestration, cloud/on-prem deployments, and real-time observability. This is not a coordination role. You will build, ship, and own. You will be the technical authority that bridges AI/ML systems with our core payments stack, leading both the platform engineering and the modeling lifecycle end-to-end.

Requirements

Go (Golang) — production-grade services
Python — ML pipelines, model serving, tooling
RESTful APIs and gRPC
Distributed systems & event-driven arch
CI/CD, Docker, Kubernetes
Cloud platforms (AWS or GCP)
Hybrid / on-prem deployment patterns
PyTorch or TensorFlow — training & fine-tuning
scikit-learn, XGBoost, or tabular ML
MLflow, Weights & Biases, or equivalent
Feature engineering & feature stores
Model monitoring & drift detection
A/B testing and shadow deployment
Low-latency inference architectures
React and Next.js
TypeScript
Component design systems
API integration patterns
Prometheus, Grafana, or Datadog
Structured logging & distributed tracing
SQL and analytical query patterns
Data pipeline tooling (Airflow, dbt, etc.)
6+ years in software engineering with strong backend foundations.
2+ years in a Tech Lead or Staff Engineer role owning a production platform end-to-end.
Demonstrated experience shipping ML/AI systems to production — not just research or notebooks.
Background in payments, fintech, or high-transaction environments strongly preferred.
Experience with on-premise deployment or hybrid infrastructure for enterprise clients is a plus.
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.

Responsibilities

Design, train, and fine-tune ML models for payment optimization use cases — including authorization rate improvement, dynamic routing, cost minimization, and fraud signal detection.
Select and apply the right frameworks (PyTorch, TensorFlow, scikit-learn) per model type and latency budget.
Own the model lifecycle: experimentation → offline evaluation → shadow deployment → A/B testing → production promotion.
Monitor and remediate model drift, data distribution shifts, and performance degradation proactively.
Define evaluation metrics that map directly to business KPIs (approval rate lift, GMV impact, provider cost).
Architect and build optimized data pipelines to collect, clean, and preprocess high-volume transaction data for model training and inference.
Design feature stores and real-time feature serving layers that keep inference latency within payments SLA requirements (<100 ms).
Establish data quality standards, schema validation, and lineage tracking across the ML data stack.
Partner with the Data Engineering team to ensure training data reflects the full distribution of providers, regions, and merchant types in our network.
Integrate ML model outputs into DEUNA's live payment routing and orchestration layer with zero tolerance for latency regressions or silent errors.
Develop and own the inference service layer in Go and Python, ensuring thread-safe, performant, and fault-tolerant operation under peak transaction load.
Lead the design of hybrid deployment architectures: cloud-native (AWS/GCP) and on-premise client environments, including secure bi-directional data synchronization.
Build and maintain RESTful and gRPC APIs that expose Athia capabilities to the broader DEUNA platform and external partners.
Own the full observability stack for Athia: real-time dashboards, alerting thresholds, anomaly detection, and post-incident reviews.
Implement model-specific monitoring (prediction distributions, confidence scores, provider error rates) alongside standard infrastructure metrics.
Create a fast feedback loop with the Operations team to detect and remediate routing degradation or GMV impact within SLA.
Define on-call runbooks and escalation paths that are clear, tested, and kept up to date.
Provide architectural guidance to scale Athia to handle 10M+ monthly transactions across concurrent global partner launches.
Lead and mentor engineers through architecture reviews, code reviews, technical planning, and day-to-day execution.
Drive engineering best practices: testing strategy (unit, integration, shadow), CI/CD pipelines, documentation standards, and security compliance.
Translate business and product goals into concrete technical roadmaps with realistic timelines and clear dependency mapping.