Senior Engineer – Agentic Runtime Safety, Stability & Observability

Keysight Technologies, Inc.•Calabasas, CA

51d

About The Position

Keysight’s Applied AI Autonomy Initiative is developing a next-generation agentic orchestration framework that enables AI agents to reason, adapt, and coordinate across complex engineering workflows. Built on LangGraph and reinforcement-inspired feedback mechanisms, this framework transforms prompts and design intents into executable orchestration strategies that evolve autonomously through iterative simulation and validation loops. Our ambition is not merely to replicate human reasoning, but to push past human limits - enabling agentic systems to explore design spaces, optimize engineering workflows, and evolve orchestration strategies at a scale and speed no human could achieve. This role defines the safety, stability, and observability architecture underpinning Keysight’s agentic runtime — the layer that ensures AI-driven orchestration remains interpretable, reversible, and aligned with human intent. You will design the mechanisms that make autonomy trustworthy: guardrails, rollback systems, introspection APIs, and adaptive feedback loops governing every agentic decision and simulator interaction.

Requirements

PhD or 5+ years of experience in systems reliability, safety-critical software, or autonomous runtime engineering.
Advanced proficiency in Python and C/C++, with experience in hybrid or simulation-based systems.
Proven expertise designing fault-tolerant, observable, and recoverable distributed systems.
Deep proficiency with agentic orchestration frameworks (LangGraph, LangChain, or equivalents).
Strong understanding of intent alignment, policy enforcement, and execution traceability in AI automation.
Hands-on experience implementing telemetry, monitoring, and introspection systems in complex runtime architectures.

Nice To Haves

Background in mission-critical or regulated runtime systems (e.g., aerospace, industrial control, EDA, or HPC).
Experience designing semantic safety validation, policy modeling, and goal disambiguation frameworks.
Familiarity with adaptive rollback, dynamic gating, and safety scoring in multi-agent environments.
Proficiency with Python/C++ interoperability (PyBind11, gRPC, ZeroMQ).
Understanding of deterministic simulation control and real-time anomaly detection in hybrid AI–physics systems.

Responsibilities

Architect runtime guardrails and authorization layers ensuring that agent actions remain aligned with operator intent, policy boundaries, and simulation constraints.
Implement intent validation, semantic disambiguation, and prompt safety checks before orchestration execution.
Define structured safety contracts governing valid operating ranges, escalation paths, and rollback logic.
Integrate safety constructs into orchestration semantics and graph-based reasoning flows with the Agentic Framework Architect.
Design deterministic rollback and checkpointing mechanisms to restore stable orchestration states after failure and enable automatic recovery paths for misaligned or unsafe agent behavior.
Engineer fault-isolation boundaries to contain local agent or simulator errors and prevent systemic instability.
Build sandboxed execution environments for validating AI-generated orchestration logic safely.
Develop interoperability safety layers between Python and RL technologies to ensure reliable data exchange and robust error containment in simulation-driven loops.
Implement comprehensive observability pipelines capturing agent reasoning traces, simulation telemetry, and orchestration health metrics.
Create real-time anomaly detection and confidence-scored safety gating to monitor drift, misalignment, or policy violations.
Develop introspection APIs and dashboards exposing safety metrics, decision rationales, and performance diagnostics.
Collaborate with DevOps and Data Intelligence teams to unify telemetry across heterogeneous runtime components into a coherent monitoring fabric.
Establish adaptive feedback systems that adjust orchestration parameters based on observed performance, safety signals, and environmental dynamics.
Define self-correcting safety policies enabling agents to learn from past instability and improve compliance autonomously.
Integrate safety scoring into promotion gates and validation workflows for runtime certification of agentic logic.
Partner with ML and validation engineers to evolve a continuous assurance pipeline that evaluates trust, stability, and interpretability over time.
Architect and own the safety, observability, and governance layer of Keysight’s agentic orchestration runtime.
Design real-time self-healing and self-correcting mechanisms that detect misalignment, autonomously mitigate instability, and restore safe operational behavior without degrading user experience.
Build deterministic rollback, checkpointing, and containment systems for multi-agent and simulation-based environments.
Implement multi-layered telemetry, anomaly detection, and runtime introspection pipelines.
Integrate observability across LLM, RL, and simulation environments into a unified safety and diagnostics interface.
Collaborate cross-functionally to embed transparency, traceability, and adaptive safety into every orchestration cycle.