Lead Infrastructure Platform Support Engineer

Wells FargoConcord, CA
Hybrid

About The Position

Wells Fargo is seeking a Lead Infrastructure Engineer to join our AI Platforms and model Support Group as part of Digital Technology and Innovations. The Lead Infrastructure Engineer is responsible for designing, building, and operating highly scalable, resilient infrastructure Production platforms that support enterprise Generative AI and Predictive AI workloads. This role provides technical leadership across GPU-accelerated environments, OpenShift/Kubernetes platforms, and advanced AI infrastructure patterns, including large AI factory scale GPU compute architectures. The engineer partners closely with platform, application, and vendor teams to ensure secure, performant, and production-grade AI solutions.

Requirements

  • 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years troubleshooting complex end-to-end architectures (including CI/CD pipeline)
  • 5+ years Linux systems experience
  • 4+ years supporting AI/ML platforms
  • 4+ years of Kubernetes / container platform experience including production support

Nice To Haves

  • Experience with Generative AI and Predictive AI platforms.
  • Hands-on GPU platform operations including scheduling, quota, and performance tuning.
  • Experience with OpenShift in GPU-enabled, multi-tenant environments.
  • Experience designing or operating GPU SuperPods.
  • Deep experience with observability using Grafana, Splunk, and custom telemetry pipelines.
  • Experience building AI- or agent-driven automation tooling (AIOps).
  • Hands-on experience supporting AI/ML workloads on GCP and Azure, including GPU-backed services and managed AI infrastructure
  • Experience operating hybrid or multi-cloud AI platforms, with an understanding of cloud-native services, networking, identity, and cost optimization for Generative and Predictive AI
  • Strong monitoring of AI signals such as inference latency and GPU utilization.
  • Experience with BCP/DR, resiliency, and highly available architectures.

Responsibilities

  • Lead complex initiatives to develop infrastructure to provide solutions for business applications
  • Participate in various projects intended to continually improve or upgrade the infrastructure
  • Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals
  • Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future
  • Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third party vendors
  • Design, code, test, debug and document programs using Agile development practices
  • Make decisions in technical designs, implementation plans and identify project risks and resource requirements
  • Direct the daily risk and control flow of operations, focusing on policies, procedures and work standards to ensure success
  • Recommend courses of action to maintain cost effectiveness and achieve results
  • Collaborate and consult with peers, colleagues and managers to resolve issues and achieve goals
  • Interact with customer and vendor

Benefits

  • Health benefits
  • 401(k) Plan
  • Paid time off
  • Disability benefits
  • Life insurance, critical illness insurance, and accident insurance
  • Parental leave
  • Critical caregiving leave
  • Discounts and savings
  • Commuter benefits
  • Tuition reimbursement
  • Scholarships for dependent children
  • Adoption reimbursement
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service