Principal Engineer

Palm Venture Studios•Redwood City, CA

About The Position

We are looking for a visionary Principal Engineer who will bridge the gap between high-level architecture and hands-on execution, specifically focusing on simplifying enterprise integration for AI agents. As a key hire during our current growth phase, you will define the standards for how our platform scales and interacts with other enterprise applications.

Requirements

10+ years of senior engineering experience at a fast-paced, high-growth technology startup that has successfully scaled from early stage through Series A/B funding (or equivalent growth phase)
5+ years of ML, including 2+ years focused on LLMs or agentic workflows.
Proficiency in agent orchestration and memory-augmented systems.
Hands-on experience analyzing tracing and logging data.
Experience using feedback loops to continuously improve ML systems
Built agents that invoked tools or utilized Model Context Protocol (MCP) to access enterprise data sources
Proficiency in modern technologies (e.g., Python, semantic search, vector DBs, GraphQL, queues, containers, Kubernetes, real-time data processing, Spark, Open Telemetry, Clickhouse)
Thrives in startup ambiguity while maintaining the discipline of an enterprise-grade engineer
Acts as a force multiplier who elevates the technical bar for the entire team
Obsessed with practical application of AI systems and capable of building enterprise solutions that solve real-world customer problems

Responsibilities

Design and implement multi-agent systems and orchestration layers.
Build and operate observability stacks (e.g., OpenTelemetry) to monitor agent reasoning paths, tool usage, and performance in real-time.
Develop and enforce technical safety mechanisms—such as input/output filtering and behavioral boundaries—to mitigate risks like hallucinations, prompt injections, and bias.
Analyze telemetry and execution traces to create feedback loops for continuous agent improvement and automated evaluation.
Securely connect agents to external services, unstructured data, and enterprise APIs via robust tool-calling schemas.
Implement fallback mechanisms, human-in-the-loop (HITL) checkpoints, and automated recovery for agentic failures.
Implement best practices for MLOps, monitoring, and performance tuning of AI models in live environments
Automate SDLC processes and CI/CD pipelines, elevate QA standards, and develop incident response protocols to enable high velocity, availability and reliability of our platform

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume