What you'll do What you'll do 1. AI Systems Engineering Design and implement large-scale, production-grade AI systems that integrate LLMs and Generative AI into real-world applications. Build frameworks that support Retrieval-Augmented Generation (RAG), agentic workflows, and multi-step reasoning at scale. Ensure models and agents are production-ready with strong observability, monitoring, and performance optimization. 2. Architecture & Scalability Architect distributed, fault-tolerant systems capable of supporting high-throughput AI workloads. Lead the design of modular, extensible, and reusable components to accelerate AI adoption across teams. Build MVPs quickly, validate assumptions, and iterate toward scalable long-term solutions. 3. Integration & Delivery Partner with product and platform teams to integrate AI into customer-facing and enterprise-grade applications. Define and enforce standards for APIs, services, and infrastructure that enable seamless AI adoption. Balance functional requirements with non-functional goals such as reliability, latency, and security. 4. Leadership & Mentorship Drive technical strategy for AI initiatives and guide teams in best practices for AI-driven software development. Mentor engineers across software and AI domains to elevate overall technical expertise. Contribute to thought leadership in AI engineering through internal frameworks, design patterns, and re-usable components. What you'll bring 12+ years of experience in software engineering (backend, distributed systems, large-scale platforms), with 2+ years applying Generative AI/LLMs in production. Proven expertise in distributed computing, cloud-native architectures (GCP, Azure, or AWS), and systems that prioritize scalability and fault tolerance. Strong coding skills in Python (preferred) and at least one system-level language (Java, Go, or C++). Experience with ML/AI frameworks (PyTorch, TensorFlow, Hugging Face) as a plus, but applied in the context of building systems, not just training models. Deep knowledge of RAG pipelines, vector databases, and real-time data integration. Familiarity with resilience engineering: disaster recovery, failover, monitoring, and high availability. Exposure to multi-modal AI (text, image, video) and optimization techniques (quantization, distillation) is advantageous. Strong grounding in system design, performance engineering, and design patterns. Track record of delivering production systems with AI at scale, not just research or prototyping.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal
Industry
General Merchandise Retailers
Number of Employees
5,001-10,000 employees