Principal AI Cloud Infrastructure Engineer

TruistCharlotte, NC
Onsite

About The Position

The AI Cloud Infrastructure Engineer owns the cloud infrastructure, environment architecture, compute management, networking, and platform operations that enable the Forge to build, deploy, scale, and operate AI and agentic systems in production with enterprise-grade reliability, security, and governance. This is a hands-on senior infrastructure engineering role. The engineer designs and operates the cloud environments, container platforms, networking layers, identity boundaries, deployment pipelines, and runtime infrastructure that AI and agentic workloads depend on. Azure is the primary cloud, with support for AWS and Google Cloud where specific AI services or workload requirements warrant multi-cloud deployment. Daily work includes provisioning and managing cloud environments, designing and maintaining container orchestration platforms, building Infrastructure as Code, managing compute and GPU resources for AI workloads, configuring networking and environment isolation, operating CI/CD deployment infrastructure, implementing identity and access controls at the infrastructure layer, instrumenting observability and telemetry, optimizing cost and performance, and ensuring all infrastructure meets Forge security, governance, and operational standards. This role is the foundation that everything else in the Forge runs on. If the infrastructure is wrong, nothing built on top of it will be reliable, secure, or scalable.

Requirements

  • Bachelor’s degree in Information Systems-related field, or equivalent education and related training
  • Minimum of five + years of experience in leading edge, complex, state-of-the art technologies and/or techniques with additional experience within software development
  • Recognized in the industry for their experience and knowledge. May obtain the knowledge through more intense experience, such as working in a technology development company
  • Strong business and financial acumen and effective communication skills
  • Ability to establish strong relationships within the technical community
  • Ability to serve as a visionary concerning future technological capabilities and operational scenarios; ability to create new business models and technologies
  • Ability to create, manage and drive change
  • Ability to unify activities within the technology community, coordinating with other businesses and engineering organizations, as needed
  • 5+ years of cloud infrastructure engineering experience with strong hands-on depth in Azure, including compute, networking, identity, storage, and container services.
  • Demonstrated experience designing and operating cloud environments for production enterprise workloads with high availability, security, and governance requirements.
  • Strong experience with container orchestration platforms (Kubernetes, AKS, or equivalent), including cluster management, node pools, networking, security, and workload scheduling.
  • Hands-on experience with Infrastructure as Code using Terraform, Bicep, or equivalent tooling with version control, peer review, and automated validation practices.
  • Experience designing and operating CI/CD deployment pipelines for cloud-native applications and services.
  • Strong understanding of cloud networking including virtual networks, subnets, private endpoints, network security groups, DNS, load balancing, and traffic controls.
  • Experience implementing identity and access management, secrets handling, and least-privilege controls for cloud resources and deployment infrastructure.
  • Experience with infrastructure observability including metrics, logging, alerting, and dashboards for cloud and container platforms.
  • Ability to work across architecture, implementation, security, reliability, and operational concerns rather than isolated provisioning tasks.
  • Strong written and verbal communication skills, especially for architecture documentation, operational runbooks, and cross-functional technical collaboration.

Nice To Haves

  • Experience provisioning and managing GPU compute, AI inference endpoints, or model-serving infrastructure in cloud environments.
  • Experience with multi-cloud environments, including AWS and Google Cloud alongside Azure as the primary platform.
  • Experience with Azure-specific AI and platform services including Azure OpenAI, Azure AI Search, Azure API Management, Azure Monitor, Microsoft Entra ID, and Microsoft Fabric.
  • Experience with container security, image governance, runtime policy enforcement, and supply chain security for containerized workloads.
  • Experience implementing policy-as-code, automated compliance scanning, and infrastructure drift detection for governed enterprise environments.
  • Experience in financial services, cybersecurity, or other highly regulated enterprise environments with strong audit, control, and environment separation requirements.
  • Experience with cost optimization, resource right-sizing, and FinOps practices for cloud AI workloads.
  • Experience supporting AI-specific infrastructure patterns including vector database hosting, model registry infrastructure, evaluation environments, and agent runtime platforms.
  • Experience mentoring engineers, reviewing infrastructure design, and operating as a senior technical contributor with broad platform impact.

Responsibilities

  • Design, provision, and operate cloud environments for AI and agentic workloads across development, testing, staging, and production tiers with clear separation, security boundaries, and promotion controls.
  • Manage Azure as the primary cloud platform, with support for AWS and Google Cloud where specific AI services, model hosting, or workload requirements dictate multi-cloud deployment.
  • Implement and maintain environment isolation patterns that protect the bank, enforce regulatory boundaries, and enable safe experimentation without production risk.
  • Operate cloud subscriptions, resource groups, tagging strategies, cost management, and resource lifecycle governance aligned to Forge operating standards.
  • Design, deploy, and operate container orchestration platforms (AKS, Kubernetes, or equivalent) that host AI applications, agent runtimes, API services, and supporting workloads.
  • Manage compute resources for AI workloads, including GPU provisioning, scaling policies, resource quotas, node pool management, and workload scheduling optimized for AI inference and training patterns.
  • Implement container security patterns including non-root execution, read-only filesystems, capability restrictions, image scanning, registry governance, and runtime policy enforcement.
  • Support containerized deployment of AI models, agent services, evaluation harnesses, and supporting microservices with production-grade reliability and performance.
  • Build and maintain all infrastructure using Infrastructure as Code (Terraform, Bicep, or equivalent) with version control, peer review, automated validation, and drift detection.
  • Design and operate CI/CD deployment infrastructure that supports automated build, test, security scan, and promotion of AI workloads through environment tiers to production.
  • Implement deployment patterns including blue-green, canary, rolling updates, and rollback capabilities for AI services and agent runtimes.
  • Manage pipeline security including secrets injection, credential rotation, service principal governance, and least-privilege deployment identities.
  • Design and maintain network architecture including virtual networks, subnets, private endpoints, service endpoints, network security groups, and traffic controls for AI workloads.
  • Implement network isolation and segmentation that protects AI systems, data flows, and inter-service communication from unauthorized access or lateral movement.
  • Configure and manage API gateways, load balancers, DNS, TLS/SSL, and ingress controllers for AI application and agent service endpoints.
  • Partner with security teams to implement infrastructure-level controls for identity, access, encryption at rest and in transit, key management, and audit logging.
  • Implement and manage identity and access controls for cloud resources, container platforms, deployment pipelines, and AI service endpoints using managed identities, service principals, and role-based access control.
  • Enforce least-privilege access across all infrastructure tiers, with clear separation between development, testing, and production permissions.
  • Support secrets management, certificate lifecycle, and credential rotation for AI services and agent integrations.
  • Instrument infrastructure observability including metrics, logs, traces, alerts, and dashboards for cloud resources, container platforms, networking, and deployment pipelines.
  • Monitor infrastructure health, capacity, performance, cost, and availability for AI workloads with proactive alerting and remediation workflows.
  • Build and maintain operational runbooks, incident response procedures, and escalation paths for infrastructure-related issues affecting AI and agentic systems.
  • Drive continuous improvement in infrastructure reliability, deployment speed, cost efficiency, and operational maturity.
  • Ensure all cloud infrastructure meets Forge security standards, enterprise governance requirements, and regulatory compliance expectations for a regulated financial services environment.
  • Implement policy-as-code and automated compliance checks for infrastructure configurations, deployment pipelines, and runtime environments.
  • Maintain infrastructure documentation, architecture diagrams, configuration evidence, and audit artifacts required for governance and regulatory review.
  • Support deployment gate validation by providing infrastructure readiness evidence for AI and agentic solution releases.

Benefits

  • medical
  • dental
  • vision
  • life insurance
  • disability
  • accidental death and dismemberment
  • tax-preferred savings accounts
  • 401k plan
  • no less than 10 days of vacation (prorated based on date of hire and by full-time or part-time status) during their first year of employment
  • 10 sick days (also prorated)
  • paid holidays
  • defined benefit pension plan (depending on the position and division)
  • restricted stock units (depending on the position and division)
  • deferred compensation plan (depending on the position and division)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service