Technical Architect - Machine Learning

Quantiphi

16h•Hybrid

About The Position

While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning and growth. If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi! About Quantiphi: Quantiphi is an award-winning, AI-First digital engineering and consulting company focused on delivering high-impact Services and Solutions that help organizations solve what truly matters. We partner with enterprises to reimagine their businesses through intelligent, scalable, and transformative AI driving measurable outcomes at the very core of their operations. Since our founding in 2013, Quantiphi has tackled some of the world’s most complex business challenges by combining deep industry expertise, disciplined cloud and data engineering practices, and cutting-edge applied AI research. Our work is rooted in delivering accelerated, quantifiable business value, not just technology for technology’s sake. Headquartered in Boston, Quantiphi is a global organization with 4,000+ professionals serving clients across key industry verticals, including BFSI, Healthcare & Life Sciences, CPG, MFG, TME etc. As an Elite and Premier partner to leading cloud and AI platforms such as NVIDIA, Google Cloud, AWS, and Snowflake, we build and deliver enterprise-grade AI services and solutions that create real-world impact. We’ve been recognized with: 17x Google Cloud Partner of the Year awards in the last 8 years. 3x AWS AI/ML award wins. 3x NVIDIA Partner of the Year titles. 2x Snowflake Partner of the Year awards. We have also garnered top analyst recognitions from Gartner, ISG, and Everest Group. We offer first-in-class industry solutions across Healthcare, Financial Services, Consumer Goods, Manufacturing, and more, powered by cutting-edge Generative AI and Agentic AI accelerators. We have been certified as a Great Place to Work for the third year in a row- 2021, 2022, 2023. Be part of a trailblazing team that’s shaping the future of AI, ML, and cloud innovation. Your next big opportunity starts here! For more details, visit: Website or LinkedIn Page.

Requirements

6-8 years of hands on experience in machine learning and AI engineering with proven track record of taking ML systems to production
Demonstrated expertise in building multi-agent systems and agentic workflows, preferably with Langraph/CrewAI
Expert-level Python proficiency with ML frameworks (TensorFlow, PyTorch, Transformers).
Experience with FastAPI, async programming, and microservices architecture
Hands-on experience with vector databases (Pinecone, Weaviate, ChromaDB) and building scalable RAG systems
Experience with LLM application monitoring tools (LangSmith, Weights & Biases, custom telemetry solutions)
Proven ability to architect and implement complex AI systems from scratch in production environments
Production-level experience with at least one major cloud platform (AWS, GCP, or Azure), including: Compute services (EC2, GCE, Azure VMs)
Serverless functions (Lambda, Cloud Functions, Azure Functions)
Container orchestration (EKS, GKE, AKS)
Managed AI/ML services (SageMaker, Vertex AI, Azure ML)
Strong skills in Infrastructure as Code (Terraform, CloudFormation), CI/CD pipelines (GitHub Actions, Jenkins), and containerization (Docker, Kubernetes)
Exceptional problem-solving and analytical thinking with ability to tackle complex, ambiguous challenges
Strong communication skills to explain complex agentic concepts to both technical and non-technical stakeholders
Proven ability to work independently and drive large-scale projects to completion with minimal supervision
Leadership mindset with experience mentoring team members and driving technical excellence

Nice To Haves

Experience with prompt engineering techniques, fine-tuning SLMs (PEFT, SFT, RLHF), and model optimization
Knowledge of distributed systems, message queues, and event-driven architectures for agent coordination
Familiarity with SDLC best practices, version control (Git), and agile development methodologies
Experience with tool-calling agents, multi-step workflows, and stateful orchestration (e.g. graphs, planners, routers).
Hands-on evals for agents: trajectory / tool-use checks, golden traces, LLM-as-judge with fixed rubrics, regression suites.
Online evals, drift thinking, and clear quality gates before or after deploy (thresholds, alerts, rollback criteria).
Safety and abuse: prompt injection via tools, untrusted retrieval, PII handling in prompts and logs, allowlists and guardrails.
Cost and latency discipline: budgets per run, timeouts, caps on turns and tool calls.
Model lifecycle: routing / gateway patterns, version pinning, fallbacks, and which model for which step.
Memory and state: what is persisted, retention, redaction, and what must never be stored

Responsibilities

Architect & Build Agentic Systems: Design and develop end-to-end multi-agent systems from scratch. You will create the foundational agent harnesses, define communication protocols, and build orchestration layers using frameworks like CrewAI, Langgraph, and AutoGen.
Architectural decisions to ensure: Hierarchical and collaborative multi-agent structures with well-defined agent roles, responsibilities, and communication protocols
Dynamic task decomposition, sophisticated tool integration, planning mechanisms (ReAct), and self-correction loops
Develop state management systems and memory mechanisms for persistent agent interactions
Engineer Advanced Agent Capabilities: Develop custom agent-tools and define specialized agent-skills that empower agents to perform complex, domain-specific tasks.
Pioneer Context Engineering: Implement advanced context engineering and memory systems to ensure agents maintain state, learn from interactions, and make informed decisions in dynamic environments.
Own the deployment, scaling, and maintenance of robust, low-latency agentic systems on major cloud platforms (GCP, AWS, or Azure).
Implement best-in-class MLOps practices for monitoring, continuous integration/continuous deployment (CI/CD), and system reliability.
Integrate LLMs to serve as the core reasoning engines for autonomous agents. You will apply advanced techniques like RAG and PEFT to optimize performance.
Create and maintain comprehensive tool libraries for agents including API integrations, database queries, and external service connections
Design and implement RAG systems using vector databases (Pinecone, Weaviate, ChromaDB)
Develop custom tools and plugins that enable agents to interact with various enterprise systems and APIs
Ensure tool reliability, error handling, and seamless integration within agentic workflows
Implement comprehensive monitoring and tracing systems for agent behavior, performance, cost optimization, and latency analysis
Design novel evaluation frameworks to assess multi-step agentic task success, reliability, and accuracy
Utilize advanced observability tools (LangSmith, Arize AI, or custom solutions) to trace agent decision making processes
Establish metrics and KPIs for measuring agentic system performance in production environments