Lead AI Engineer

TIAA•Iselin, NJ

1d•Onsite

About The Position

Nuveen, the investment manager of TIAA, offers a comprehensive range of outcome-focused investment solutions designed to secure the long-term financial goals of institutional and individual investors. Its affiliates offer deep expertise across a comprehensive range of traditional and alternative investments through a wide array of vehicles and customized strategies. Nuveen is a global investment manager that works in partnership with our clients to create outcome-focused solutions to help them reach their goals for their financial future. The Lead AI Engineer is a key technical leader and major contributor to a high-performing, fast-paced engineering team responsible for designing, building, and deploying enterprise-grade generative AI solutions. This role requires deep expertise in distributed systems, scalable architecture, and cutting-edge AI/ML technologies, with a focus on delivering production-ready applications on AWS using Python. As a hands-on technical leader, you will be involved in the full software development lifecycle (SDLC), from requirements gathering and architecture design through implementation, deployment, and ongoing optimization. You will design robust, low-latency AI applications, implement best practices in DevOps and MLOps, and ensure all solutions meet enterprise standards for security, governance, and compliance.

Requirements

Bachelor's Degree Required
5+ Years Required; 7+ Years Preferred
5+ years of software engineering experience with demonstrated progression in technical leadership and system design
3+ years of hands-on experience with AI/ML, with at least 1+ year focused on Generative AI, LLMs, and production deployment
Expert-level Python programming with deep knowledge of advanced language features, design patterns, performance optimization, and popular frameworks (FastAPI, Flask, Pandas, NumPy).
Full-stack development skills including backend API development with RESTful design principles, frontend development using React JS, database design and optimization (SQL and NoSQL)
Extensive AWS experience with hands-on implementation of compute, storage, networking, security, and AI/ML services.
Production experience with Generative AI technologies: LLM APIs (Open AI or Anthropic Claude), RAG frameworks and vector databases, Prompt engineering and optimization techniques, AI agent frameworks (Lang Chain and Lang Graph), Model fine-tuning and evaluation
Experience in building CI/CD pipelines using Infrastructure as Code (Terraform, CloudFormation), Container orchestration (Docker, Kubernetes/EKS), Monitoring and observability tools
Understanding of distributed systems, microservices architecture, event-driven design, and scalability patterns

Nice To Haves

Experience with MLOps platforms (Domino platform, Sage Maker Pipelines, ML flow)
Experience with additional cloud platforms (Azure, GCP) and multi-cloud architectures
Contributions to open-source AI/ML projects or published research
Knowledge of additional programming languages (C++, Go, Rust)
Experience with real-time streaming and event-driven architecture
Familiarity with advanced AI techniques (multimodal models, vision transformers, diffusion models)
Good communication and technical writing skills
AWS certifications (Solutions Architect, Machine Learning Specialty) preferred

Responsibilities

Design and implement Generative AI solutions using RAG (Retrieval-Augmented Generation) pipelines.
Build end-to-end systems integrating vector databases, embedding models, and LLMs to enable context-aware, knowledge-grounded responses.
Develop robust prompting strategies, templates, and workflows that maximize LLM performance, accuracy, and consistency.
Establish rigorous evaluation frameworks to measure model accuracy, latency, cost, hallucination rates, and task-specific performance metrics; conduct A/B testing and comparative analysis across models and configurations.
Implement comprehensive logging, tracing, and alerting systems to track model behavior, prompt-response patterns, token usage, errors, and drift in production environments.
Build production-grade AI agents using both low-code platforms and high-code custom implementations, using Langchain, Langgraph, and optimize for performance, and maintainability.
Architect and develop large-scale, cloud-native Python applications using modern frameworks such as FastAPI, Flask, optimized for high performance, low latency, and horizontal scalability.
Design distributed system architectures that leverage AWS services including Lambda , ECS/EKS , EC2 , S3 , DynamoDB , RDS/Aurora ), ElastiCache , OpenSearch , SQS , SNS , EventBridge , Step Functions , Bedrock , Textract and Domino/SageMaker platforms.
Build responsive, intuitive user interfaces using React, TypeScript/JavaScript, and modern frontend frameworks to deliver seamless user experiences for AI-powered applications.
Implement API design best practices including RESTful principles, Open API/Swagger documentation, versioning strategies, rate limiting, authentication/authorization, and error handling.
Optimize application performance through caching strategies, asynchronous processing, connection pooling, efficient data serialization, and proactive bottleneck identification.
Design for reliability and resilience by implementing retry logic, circuit breakers, graceful degradation, health checks, and disaster recovery mechanisms.
Establish and enforce CI/CD best practices using GitHub Actions, Jenkins, GitLab CI, or AWS Code Pipeline to automate build, test, and deployment processes.
Implement Infrastructure as Code (IaC) using Terraform, AWS CloudFormation, or CDK to enable consistent, version-controlled, and reproducible infrastructure provisioning.
Design and manage containerized applications using Docker for packaging and Kubernetes (EKS) or ECS for orchestration, ensuring efficient resource utilization and auto-scaling.
Implement robust testing strategies including unit tests, integration tests, end-to-end tests, performance tests, and AI-specific testing (prompt regression tests, model output validation).
Establish observability and monitoring frameworks using CloudWatch, Prometheus, or Langfuse, LangSmith to track system health, application performance, model behavior, and business metrics.
Apply security best practices including IAM, least-privilege access, role-based access control (RBAC), multi-factor authentication, enforce encryption at rest and in transit, secure key management, and data masking/tokenization for sensitive information.
Configure VPCs, security groups, network ACLs, and private endpoints to minimize attack surface.
Implement input validation, output encoding, SQL injection prevention, and secure API authentication (OAuth 2.0, JWT).
Maintain comprehensive documentation of system architectures, data flows, security controls, and operational procedures to support compliance audits and knowledge transfer.