Staff Engineer, Lustre

Data Direct Networks•San Francisco - Remote, CA

19h•Remote

About The Position

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Requirements

10+ years of experience in systems software, distributed systems, storage, Linux kernel or filesystem engineering.
Strong experience in LustreFS development, support or performance engineering with depth in at least one major subsystem.
Strong C programming and Linux systems debugging skills.
Working knowledge of Linux kernel internals, filesystem semantics, networking and performance analysis.
Experience with LNet and/or high-performance transports such as RDMA, InfiniBand, RoCE or TCP-based storage networking.
Ability to debug and resolve issues spanning multiple layers including client, server, network and backend storage.
Strong collaboration skills and the ability to work across functions in a fast-moving engineering environment.

Nice To Haves

Experience in HPC, AI infrastructure or large-scale parallel storage environments.
Exposure to metadata-heavy and throughput-heavy workload characterization and tuning.
Familiarity with ZFS, ldiskfs, NVMe-backed storage and related observability / performance tooling.
Experience creating test plans, reproducer frameworks, runbooks or diagnostic automation.
Comfort using AI tools to accelerate debugging, code reviews, triage, documentation and early-stage design ideation.
Experience mentoring junior engineers or leading focused technical efforts within a subsystem.

Responsibilities

Design, develop and debug LustreFS features, fixes and enhancements across relevant subsystems such as llite, MDS/MDT, OSS/OST, LDLM and LNet.
Investigate customer and scale-related defects, drive root-cause analysis and implement high-quality fixes with strong attention to correctness and maintainability.
Contribute to performance tuning, failure analysis and reliability improvements for large-scale Lustre deployments.
Participate actively in code reviews, design reviews and subsystem discussions, bringing rigor to testing and operational readiness.
Work closely with QE and support to reproduce issues, improve diagnostic data quality and increase coverage for high-risk failure scenarios.
Help document subsystem behavior, debugging approaches, known failure patterns and operational best practices.
Use AI-assisted tools where appropriate to speed up issue triage, summarize logs, improve code understanding and capture reusable lessons learned.