Principal, AI Platform Engineering

Ares Management Corporation•New York, NY

About The Position

We are seeking an exceptional Principal AI Platform Engineer to design and build an enterprise-grade generative AI platform from the ground up. This is a leadership role that combines deep technical expertise in AI systems architecture with the strategic vision to shape how our organization scales AI capabilities across all business domains. You will architect a comprehensive platform spanning model gateways, retrieval services, model registries, prompt libraries, and deployment pipelines—enabling teams across the firm to build, deploy, and operationalize AI applications with confidence, compliance, and security.

Requirements

10+ years of software engineering experience, with 5+ years building large-scale, distributed systems or platform infrastructure
3+ years of hands-on experience with generative AI, LLMs, RAG systems, or AI infrastructure—either in production systems or applied research
Deep expertise in one or more: Python, Go, Rust, or Java; experience building APIs and orchestration systems
Strong understanding of LLM architectures, prompting strategies, fine-tuning, and RAG design patterns
Demonstrated experience with: model serving (vLLM, Ollama, TensorFlow Serving), vector databases, and embedding models
Proficiency in cloud platforms (AWS, GCP, Azure) and containerization/orchestration (Docker, Kubernetes)
Experience designing and building multi-tenant, secure platform systems with strong governance and observability
Demonstrated expertise in security: architecture, secure coding practices, authentication/authorization, encryption, and threat modeling
Experience with compliance frameworks and security certifications: SOC 2, ISO 27001, GDPR, or similar
Track record of leading technical initiatives from architecture through production deployment
Excellent communication skills; ability to explain complex technical and security concepts to executives and cross-functional teams

Nice To Haves

Experience in financial services, private equity, or alternative assets technology environments
Familiarity with LangChain, LlamaIndex, or similar AI orchestration frameworks
Experience with MLOps tools and practices: model versioning, feature stores, experiment tracking
Knowledge of eval frameworks, retrieval evaluation, or AI model benchmarking
Experience with data governance platforms or metadata management systems
Experience building zero-trust architectures or implementing security controls in cloud-native environments
Contributions to open-source AI/ML projects or publications in the AI/ML space
Experience in building developer platforms or internal tools that drive organizational adoption

Responsibilities

Platform Architecture & Design Design and build a foundational AI platform that enables secure, scalable, and compliant generative AI across the enterprise
Architect multi-LLM gateway capabilities to support diverse model providers, allowing teams to leverage best-of-breed models for different use cases
Establish platform standards and patterns that balance flexibility, safety, governance, and performance
Core Platform Components Develop multi-LLM gateway: unified interface for accessing multiple LLM providers with load balancing, fallback handling, and cost optimization
Build RAG (Retrieval-Augmented Generation) retrieval services: enterprise search, semantic indexing, and document retrieval at scale
Create model registry and governance: centralized catalog of models, versions, fine-tuning metadata, performance metrics, and compliance tracking
Design prompt library and version control: organizational repository for prompts with testing, evaluation, and A/B testing capabilities
Implement Model Context Protocol (MCP) gateway: enable secure integration between AI applications and external tools, APIs, and data sources
Build FinOps infrastructure: cost tracking, optimization, and allocation across models, usage patterns, and business units
Agent-to-Agent (A2A) Workflows Design orchestration framework for complex, multi-step AI workflows across applications
Enable reliable, scalable execution of chained AI operations with state management and error recovery
Integrate with broader data ecosystem for workflow triggers and data pipelines
Data Gateway Integration Partner with data platform teams to design AI-native data access patterns
Enable secure, governed access to enterprise data and RAG and model training
Build metadata and lineage tracking for AI-consumed data
Deployment & DevOps Design sandbox-to-production pipelines: safe, repeatable processes for testing and deploying AI applications
Implement CI/CD for AI models: versioning, testing, promotion, and rollback capabilities
Build observability and monitoring: telemetry, performance metrics, cost tracking, and compliance auditing
Establish disaster recovery and high-availability patterns
Collaboration & Enablement Work closely with Data Products team to align platform capabilities with data governance and analytics infrastructure
Partner with AI Enablement teams to provide tools, SDKs, documentation, and best practices that democratize AI development
Lead technical discussions on platform strategy, roadmap, and trade-offs across the organization
Build internal developer experience and platform adoption
Security Architecture & Implementation Design and implement comprehensive security architecture aligned with firm cyber and information security guidelines
Build authentication and authorization frameworks: role-based access control (RBAC), attribute-based access control (ABAC), and service-to-service authentication
Implement encryption standards: encryption at rest (AES-256 or equivalent) and in transit (TLS 1.2+) for all sensitive data
Design secure API gateways and service boundaries with rate limiting, request validation, and DDoS protection
Implement secrets management: secure storage and rotation of credentials, API keys, and certificates
Build comprehensive audit logging and monitoring: all access, modifications, and security events logged with immutable audit trails
Partner with Infosec and Security Operations to implement continuous security monitoring and threat detection
Governance, Compliance & Risk Management Ensure platform compliance with regulatory requirements: SOC 2 Type II, data residency, and audit trails
Implement data governance: classify data sensitivity levels, enforce data handling policies, and ensure appropriate access controls
Build model governance: track model provenance, versioning, training data lineage, and approval workflows for production deployment
Prevent data exfiltration and prompt injection attacks through input validation, output filtering, and rate limiting
Establish responsible AI practices: bias detection, fairness assessment, and explainability requirements
Manage third-party vendor security: assess LLM provider security postures, data processing agreements, and compliance certifications
Create model risk assessment framework: evaluate models for regulatory, market, and operational risks before production deployment
Work with Compliance, Legal, and Risk teams to ensure platform meets all governance requirements and documentation standards