AI Diagnostics & Observability Engineer

Sage Care IncPalo Alto, CA
9d

About The Position

Own and build the full diagnostic, observability, and RCA infrastructure that makes Sage Care’s AI assistant trustworthy and debuggable—in real time and post-call. This engineer builds the visibility layer across telephony, transcription, reasoning, SOP traversal, and tool-calling; creates dashboards for both engineers and live human supervisors; and implements automated triage + notification pipelines that surface issues to the right module owners immediately. This role sits at the intersection of LLM orchestration, voice pipelines, transcription, SOP engines, and operations, serving as the connective tissue across the stack. Your work enables rapid root-cause analysis, real-time intervention, and continuous improvement of our clinical AI assistants.

Requirements

  • Strong backend engineer experienced with diagnostics, observability, and event-driven tracing.
  • Expert in Python, logging systems, real-time pipelines, and distributed debugging.
  • Deep familiarity with: LLM agents LangGraph or state-machine frameworks Tool-calling architectures Telemetry or tracing frameworks
  • Comfortable designing both: Backend data pipelines Frontend dashboards in React, D3, WebSockets, or equivalent.

Nice To Haves

  • Telephony/Voice: SIP, WebRTC, Twilio, audio streaming pipelines.
  • Clinical operations, call-center workflows, or mission-critical HITL supervision systems.
  • Observability stacks (Grafana, ELK, OpenTelemetry, Sentry).
  • Clustering/ML techniques for failure pattern discovery.

Responsibilities

  • Build automated RCA pipelines to detect and classify failure modes: Hallucinations Misrouted intents Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP) Unrecoverable SOP loops Broken state transitions Telephony dropouts / DTMF issues
  • Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution.
  • Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth.
  • Automatically compute performance, safety, reliability, and coverage metrics.
  • Build live and post-call dashboards that visualize: Full call timeline SOP/state machine traversal Agent reasoning traces Tool invocation history Divergence from expected behavior
  • Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots.
  • Build triage dashboards for engineering and operations teams to rapidly understand system health.
  • Trace call-level events (dropouts, retries, audio playback issues).
  • Detect DTMF misfires and incorrect action routing.
  • Analyze turn segmentation, word-error-rate drift, boosting performance, and latency.
  • Visualize errors in context (audio, transcript, aligned timecodes).
  • Audit intent classification accuracy and subgraph routing.
  • Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments.
  • Validate tool call correctness (maps, SMS, search, internal SOP tools).
  • Architect a live SOP state-machine tracer with: Real-time transcript overlays Current state + next expected state Deviation alerts
  • Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with: Loops Latency spikes Failed tool calls Repeated incorrect decisions
  • Provide human specialists with escalation alerts and context.
  • Build an intervention console for on-call specialists, enabling: “Skip step” “Say apology” “Escalate to human” “Send SMS” “Repeat last message”
  • Override of SOP steps while maintaining auditability and continuity.
  • Build clustering systems (via embeddings or metadata) to group systemic failure modes: Intent misroutes under noisy audio Repeated missing tool calls Looped state machine traversal Hallucinated follow-ups or invalid summaries
  • Generate recurring-failure reports to guide engineering improvements.
  • Design and implement an automated triage and notification system that: Detects failure category and severity in real time.
  • Routes incidents to the correct module owners: Telephony Transcription LLM orchestration SOP/decision-tree team Platform reliability
  • Sends structured payloads containing: Trace graphs Relevant logs Transcript segments SOP divergence snapshots Suggested RCA labels
  • Notifications may integrate with: PagerDuty Slack (rich message blocks) Jira auto-ticket creation Internal incident pipelines
  • Extend pipelines to automatically generate human-readable failure summaries with: Call-level trace graphs Tool call sequences Transcript context Classified failure types Suggested root causes
  • Store snapshots for operational handoff and debugging.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service