Staff Software Engineer, Forecasting

Zeta Global•San Francisco, CA

About The Position

We are hiring a hands-on Staff Software Engineer to provide technical leadership for our Forecasting and Recommendations platforms, with a strong focus on production-grade AI and agentic systems. This role centers on designing, building, and operating high-throughput, low-latency distributed systems that power forecasting, recommendations, and AI-driven decisioning at scale. You will work deeply in backend systems, infrastructure, and AI application architecture, while remaining accountable for reliability, observability, and operational excellence. We are looking for an engineer who can contribute immediately, has shouldered real production incidents, and brings strong judgment around building stable, observable, and scalable systems, including modern agentic and LLM-powered applications.

Requirements

10+ years of professional software engineering experience building and operating production-grade distributed systems.
A strong track record of hands-on ownership of business-critical services, including measurable improvements in latency, throughput, stability, or cost.
Deep expertise in systems design, including service boundaries, concurrency, data modeling, failure handling, and scalability tradeoffs.
Production experience supporting machine learning–driven systems (forecasting, recommendations, or similar), with emphasis on serving, pipelines, and infrastructure.
Expert-level experience with AWS, including designing, deploying, and operating large-scale cloud-native systems.
Strong hands-on experience with Kubernetes, containerized microservices, and modern CI/CD pipelines.
Experience operating software in both on-prem data center and AWS cloud environments.
Fluency with modern AI-assisted development tools (e.g., Cursor, GitHub Copilot) and comfort working in “vibe coding”–style workflows that favor fast iteration, tight feedback loops, and continuous refactoring.
Proficiency in one or more backend languages commonly used for large-scale systems (e.g., Python, Java, Go, Scala).
Bachelor’s or master’s degree in computer science, Mathematics, or a related field, or equivalent practical experience.

Responsibilities

Design, build, and operate systems supporting forecasting, recommendations, and agentic AI workflows in production.
Write production-quality code daily; own services end-to-end from design through on-call and incident resolution.
Architect low-latency, high-throughput SaaS services, including APIs, data pipelines, model inference, and agent orchestration.
Build and maintain production-grade agentic applications, including tool-using agents, workflow orchestration, and guardrails.
Work fluently with foundational LLMs (e.g., GPT, Claude, Gemini Pro), selecting appropriate models and deployment patterns based on latency, cost, and reliability tradeoffs.
Use frameworks and tooling such as LangChain, voice agents, and related ecosystems to accelerate development—while enforcing production discipline.
Embrace AI-assisted development workflows (e.g., Cursor, GitHub Copilot, vibe coding paradigms) to move quickly without sacrificing quality.
Champion observability and reliability: metrics, logging, tracing, alerting, and post-incident analysis.
Lead and participate in production incident response, retrospectives, and systemic fixes.
Identify architectural risks early and make design decisions that prevent outages and scalability issues.
Reduce complexity across services, infrastructure, and processes to improve stability and team velocity.
Provide technical guidance across teams and participate in architectural reviews beyond your immediate domain.