Lead Specialty Software Engineer- Observability & Agentic AI Platforms

Wells Fargo & Company•Phoenix, AZ

4d•Onsite

About The Position

Wells Fargo is seeking a Lead AI Ops Engineer to own and advance the Commercial Observability Platform. This role provides technical leadership across agentic AI systems, AI‑powered observability, advanced analytics, and enterprise telemetry platforms, enabling proactive monitoring, faster root cause analysis, and improved operational resilience across critical business applications. This position is intended for a senior, hands‑on AI engineer who will serve as a technical role model and bar raiser, setting standards for engineering excellence in AI‑driven observability and operations

Requirements

5+ years of Specialty Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
3+ years hands on experience in platform engineering, SRE, observability
3+ years hands‑on software engineering experience building production services, APIs, data pipelines, or AI systems
2+ years experience designing or implementing AI, AI Ops, or ML‑driven systems in production environments (LLMs, generative AI, or agentic AI systems)
2+ years experience with Splunk SPL, including writing advanced, multi‑stage queries equivalent to programmatic logic, building complex Splunk dashboards and analytics used for operational decision‑making, complex queries
5+ years experience in distributed systems, microservices, and cloud‑native architectures
2+ hands‑on experience with enterprise observability platforms such as Splunk, Splunk Observability, AppDynamics, or equivalent tools, Grafana or prometheus
This position is not eligible for Visa sponsorship or transfer of visa
Ability to work on-site at approved location

Nice To Haves

Proven ability to perform deep root cause analysis using application code and telemetry data
Experience designing or implementing multi‑agent or autonomous AI workflows
Familiarity with AI frameworks and tooling (for example: LangChain, LangGraph, AutoGen, CrewAI, or equivalent concepts)
Experience designing and building custom telemetry ingestion pipelines or Beacon APIs
Familiarity with OpenTelemetry and modern instrumentation standards
Experience building internal observability, analytics, or AI platforms used by multiple engineering teams
Ability to act as a technical bar raiser, influencing engineering standards across AI, analytics, and observability domains

Responsibilities

Design, build, and maintain production‑grade AI and agentic systems that reason over observability data including logs, metrics, traces, events, and digital experience signals
Develop LLM‑powered workflows to support automated incident analysis, intelligent alerting, operational insights, and root cause analysis (RCA) summaries
Architect and implement agentic or multi‑agent AI workflows that decompose complex operational problems, analyze telemetry across multiple tools, and coordinate actionable recommendations
Apply AIOps and machine learning techniques such as anomaly detection, correlation, pattern recognition, forecasting, noise reduction, and predictive insights
Write and maintain Python‑based AI services, orchestration logic, and data pipelines deployed in production environments
Establish best practices for AI system observability, governance, feedback loops, and continuous improvement
Lead the design, implementation, and evolution of enterprise observability platforms supporting commercial applications
Own and operate observability tools including Splunk Observability, Splunk (logs, metrics, traces), AppDynamics, and Glassbox
Define and enforce standards for telemetry collection, including logging, metrics, distributed tracing, and real user monitoring
Perform and lead complex root cause analysis by analyzing application code, logs, metrics, traces, infrastructure signals, and user experience data
Act as a senior Splunk query developer, designing highly complex SPL queries that function as analytical programs to correlate large volumes of telemetry data
Build and optimize advanced Splunk dashboards using multi‑stage SPL pipelines, statistical functions, joins, lookups, and enrichments
Develop Splunk analytics that power real‑time operational insights, advanced alerting, historical analysis, and AI model inputs
Design and develop Beacon / Telemetry APIs to collect custom application, platform, and business signals
Build and maintain telemetry ingestion services that normalize, store, and enrich data for analytics and AI/ML solutions
Partner closely with application engineering, SRE, and platform teams to improve reliability, performance, and operational maturity
Provide technical leadership and mentoring, serving as a role model for strong AI, analytics, and observability engineering practices
Influence engineering standards and contribute to long‑term observability and AI platform strategy