About The Position

Wells Fargo is seeking a Lead AI Ops Engineer to own and advance the Commercial Observability Platform. This role provides technical leadership across agentic AI systems, AI‑powered observability, advanced analytics, and enterprise telemetry platforms, enabling proactive monitoring, faster root cause analysis, and improved operational resilience across critical business applications. This position is intended for a senior, hands‑on AI engineer who will serve as a technical role model and bar raiser, setting standards for engineering excellence in AI‑driven observability and operations

Requirements

  • 5+ years of Specialty Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 3+ years hands on experience in platform engineering, SRE, observability
  • 3+ years hands‑on software engineering experience building production services, APIs, data pipelines, or AI systems
  • 2+ years experience designing or implementing AI, AI Ops, or ML‑driven systems in production environments (LLMs, generative AI, or agentic AI systems)
  • 2+ years experience with Splunk SPL, including writing advanced, multi‑stage queries equivalent to programmatic logic, building complex Splunk dashboards and analytics used for operational decision‑making, complex queries
  • 5+ years experience in distributed systems, microservices, and cloud‑native architectures
  • 2+ hands‑on experience with enterprise observability platforms such as Splunk, Splunk Observability, AppDynamics, or equivalent tools, Grafana or prometheus
  • This position is not eligible for Visa sponsorship or transfer of visa
  • Ability to work on-site at approved location

Nice To Haves

  • Proven ability to perform deep root cause analysis using application code and telemetry data
  • Experience designing or implementing multi‑agent or autonomous AI workflows
  • Familiarity with AI frameworks and tooling (for example: LangChain, LangGraph, AutoGen, CrewAI, or equivalent concepts)
  • Experience designing and building custom telemetry ingestion pipelines or Beacon APIs
  • Familiarity with OpenTelemetry and modern instrumentation standards
  • Experience building internal observability, analytics, or AI platforms used by multiple engineering teams
  • Ability to act as a technical bar raiser, influencing engineering standards across AI, analytics, and observability domains

Responsibilities

  • Design, build, and maintain production‑grade AI and agentic systems that reason over observability data including logs, metrics, traces, events, and digital experience signals
  • Develop LLM‑powered workflows to support automated incident analysis, intelligent alerting, operational insights, and root cause analysis (RCA) summaries
  • Architect and implement agentic or multi‑agent AI workflows that decompose complex operational problems, analyze telemetry across multiple tools, and coordinate actionable recommendations
  • Apply AIOps and machine learning techniques such as anomaly detection, correlation, pattern recognition, forecasting, noise reduction, and predictive insights
  • Write and maintain Python‑based AI services, orchestration logic, and data pipelines deployed in production environments
  • Establish best practices for AI system observability, governance, feedback loops, and continuous improvement
  • Lead the design, implementation, and evolution of enterprise observability platforms supporting commercial applications
  • Own and operate observability tools including Splunk Observability, Splunk (logs, metrics, traces), AppDynamics, and Glassbox
  • Define and enforce standards for telemetry collection, including logging, metrics, distributed tracing, and real user monitoring
  • Perform and lead complex root cause analysis by analyzing application code, logs, metrics, traces, infrastructure signals, and user experience data
  • Act as a senior Splunk query developer, designing highly complex SPL queries that function as analytical programs to correlate large volumes of telemetry data
  • Build and optimize advanced Splunk dashboards using multi‑stage SPL pipelines, statistical functions, joins, lookups, and enrichments
  • Develop Splunk analytics that power real‑time operational insights, advanced alerting, historical analysis, and AI model inputs
  • Design and develop Beacon / Telemetry APIs to collect custom application, platform, and business signals
  • Build and maintain telemetry ingestion services that normalize, store, and enrich data for analytics and AI/ML solutions
  • Partner closely with application engineering, SRE, and platform teams to improve reliability, performance, and operational maturity
  • Provide technical leadership and mentoring, serving as a role model for strong AI, analytics, and observability engineering practices
  • Influence engineering standards and contribute to long‑term observability and AI platform strategy

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service