Staff Software Engineer

DataRobot•Boston, MA

About The Position

DataRobot delivers AI that maximizes impact and minimizes business risk. Our platform and applications integrate into core business processes so teams can develop, deliver, and govern AI at scale. DataRobot empowers practitioners to deliver predictive and generative AI, and enables leaders to secure their AI assets. Organizations worldwide rely on DataRobot for AI that makes sense for their business — today and in the future. As a Staff Software Engineer focused on Application Scalability & Performance, you will lead the design, implementation, and operation of backend systems that power high-throughput AI applications. Your work will ensure our applications perform with high accuracy, minimal latency, and robust scalability (including autoscaling), all while maintaining reliability, cost-effectiveness, and maintainability. You will collaborate closely with Product, Platform, and AL/ML teams to deliver high impact.

Requirements

7+ years of backend engineering experience building scalable, high-performance distributed systems / services.
Strong experience with performance optimization: e.g. profiling, latency tuning, concurrency, caching strategies.
Deep experience with autoscaling, resource management, load balancing, throughput/latency SLAs.
Solid programming skills in one or more backend languages (e.g. Python, Java, Go, C++, or equivalent).
Strong understanding of observability and monitoring: metrics, tracing, logging; and instrumentation of services.
Design and architect scalable AI-backed services and applications, integrating AI models into production systems with high performance, reliability, and low latency.
Ability to solve ambiguous challenges and influence technical direction across teams, balancing performance, accuracy, and cost.
Experience operating across multiple cloud providers (AWS, GCP, Azure) and/or hybrid environments.

Nice To Haves

Experience with AI/ML model deployment, serving, inference, and production integration.
Experience with Gen AI / serving LLMs, embeddings, etc.
Exposure to on-prem delivery models or regulated environments.
Experience with Docker and building containerized applications.
Open source software development experience or contributions.

Responsibilities

Architect, build, and lead backend services that scale to handle large workloads, high concurrency, and low latency requirements.
Design and implement autoscaling strategies (horizontal/vertical), dynamic resource allocation, and load balancing to ensure responsive, cost-efficient service.
Improve end-to-end request pipelines, optimizing for latency, throughput, reliability, and correctness.
Instrument, monitor, and profile systems in production; identify bottlenecks, troubleshoot performance issues, and proactively tune services.
Collaborate with ML/AI teams to ensure models’ serving pipelines uphold accuracy, consistency, and performance under load.
Drive best practices in systems reliability, observability, error handling, capacity planning, resilience, and failover.
Mentor and coach other engineers; provide technical leadership and influence across teams.
Contribute to defining architecture, coding standards, performance benchmarks, and technical roadmap items related to scalability and performance.