Senior AI Platform Engineer

Infios

About The Position

Infios is seeking a Senior AI Platform Engineer with deep expertise in spec-driven AI SDLC and strong hands-on experience with AWS AI infrastructure (Bedrock, Bedrock Agents, Agent Core). The role involves championing a specification-first approach to AI development, translating product requirements into rigorous AI specs, building LLM-powered and agentic applications using Spring AI, and owning the full lifecycle from prototype through production on AWS. The company values excellent problem solving, clear communication, and engineers who bring discipline and craft to AI product delivery. Infios is a leader in supply chain software solutions, developing future technologies to improve supply chains.

Requirements

Spec-Driven AI SDLC: Deep expertise in the AI software development lifecycle with a specification-first mindset.
Experience authoring AI feature specs (acceptance criteria, evaluation metrics, prompt contracts) and driving the full lifecycle from prototyping through evaluation frameworks, A/B testing, deployment of non-deterministic systems, and production monitoring (drift detection, quality scoring, feedback loops).
Track record of shipping AI-powered features through multiple product cycles with engineering rigor.
AWS AI Infrastructure: Strong hands-on experience with Amazon Bedrock, Bedrock Agents, Agent Core, SageMaker, and Amazon Q.
Solid knowledge of core AWS infrastructure including compute (ECS/EKS, Lambda), databases (RDS, DynamoDB, ElastiCache), networking (VPC, ALB, CloudFront), and security (IAM, KMS, Secrets Manager).
Experience architecting AI infrastructure pipelines with cost optimization and high availability.
LLM Frameworks & Agentic AI: Hands-on experience building production applications with Spring AI.
Solid understanding of LLM application patterns (prompt management, RAG, context orchestration, vector stores, evaluation) and agentic workflows (multi-step agents, tool-use orchestration, planning loops).
Java, TypeScript & Python: 5+ years of professional software engineering with strong proficiency across all three languages — Java (Spring Boot, Spring Cloud), TypeScript (Node.js, modern frameworks), and Python (AI tooling, evaluation frameworks).
Comfortable choosing the right language for each task.
Enterprise & Large-Scale Systems: Experience designing and operating distributed systems at scale.
Familiarity with event-driven architectures, message brokers (Kafka, SQS/SNS), caching (Redis, ElastiCache), and relational/NoSQL database design.
DevOps & Infrastructure: Proficiency in CI/CD pipelines, Infrastructure as Code (Terraform, CloudFormation), containerization (Docker, Kubernetes/EKS), and GitOps workflows.
Problem Solving & Communication: Excellent analytical skills and the ability to tackle complex, ambiguous challenges independently.
Outstanding written and verbal communication — able to articulate technical concepts to diverse audiences and collaborate effectively across teams.
Education: Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field (or equivalent practical experience).

Responsibilities

Define AI feature specifications upfront — including acceptance criteria, evaluation metrics, prompt contracts, and expected behaviors — and champion this spec-driven approach across the team.
Own end-to-end AI feature delivery across the full AI SDLC: spec definition, prototyping, development, evaluation, deployment, and production monitoring.
Build production-grade LLM and agentic AI applications using Spring AI — including RAG pipelines, agent orchestration, tool-use patterns, guardrails, and human-in-the-loop workflows.
Architect and operate AWS AI infrastructure (Bedrock, Bedrock Agents, Agent Core, SageMaker) alongside core AWS services (ECS/EKS, Lambda, S3, DynamoDB, RDS, API Gateway).
Design and implement scalable microservices and distributed systems in Java, TypeScript, and Python that power the Archer AI platform.
Build CI/CD pipelines for AI workloads — including LLM evaluation pipelines and automated regression testing for AI outputs — using Terraform, CloudFormation, Docker, Kubernetes, and GitHub Actions.
Drive AI-specific operational practices: observability, drift detection, quality scoring, feedback loops, and incident response for non-deterministic systems.
Communicate technical concepts clearly to both technical and non-technical stakeholders; author AI specs, design documents, and architectural decision records.
Mentor engineers, conduct thorough code reviews, and champion engineering excellence.