Staff Engineer

Data Direct NetworksSan Francisco - Remote, CA
Remote

About The Position

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. Job Description DDN is seeking a Staff Replication Development Engineer to lead the design and development of the replication engine for the Infinia AI Data Platform. This role focuses on building enterprise-grade asynchronous replication capabilities that enable reliable and secure disaster recovery for large-scale data systems. You will work on developing high-performance replication pipelines, efficient data synchronization mechanisms, and secure data transfer systems. This role requires deep expertise in distributed systems and strong technical leadership to deliver a scalable and resilient replication foundation.

Requirements

  • 8+ years of experience in distributed systems, storage systems, or backend software engineering
  • Strong programming skills in one or more languages: C++, Go, Java, or Rust
  • Experience designing and building data replication systems, data pipelines, or distributed data services
  • Deep understanding of distributed systems concepts (consistency, availability, scalability, fault tolerance)
  • Strong expertise in multi-threading, concurrency, and parallel processing
  • Knowledge of networking protocols and secure communication (TCP/IP, HTTP/HTTPS, TLS)
  • Experience implementing data integrity mechanisms (checksums, validation, consistency checks)
  • Experience designing and building REST APIs and service-based architectures
  • Familiarity with checkpointing, failure recovery, and retry mechanisms in distributed systems
  • Basic understanding of observability concepts (metrics, logging, alerting)
  • Strong debugging, problem-solving, and system design skills

Nice To Haves

  • Experience with asynchronous replication, disaster recovery (DR), or backup systems
  • Familiarity with object storage or large-scale data storage systems
  • Knowledge of delta encoding, change data capture, or incremental data synchronization techniques
  • Experience building high-throughput, low-latency data movement systems
  • Exposure to security practices including mutual TLS, encryption, and authentication
  • Experience working on enterprise-scale data platforms or storage products
  • Familiarity with performance optimization and large-scale system tuning

Responsibilities

  • Design and develop multi-threaded asynchronous replication systems with parallel streaming capabilities
  • Build object-level delta replication with checkpointing and resume functionality
  • Develop replication engines supporting bucket/share-level replication controls
  • Implement secure data transfer mechanisms using TLS 1.3 with mutual authentication
  • Ensure end-to-end data integrity through checksum validation and verification pipelines
  • Design and implement manual failover workflows for disaster recovery scenarios
  • Build and maintain REST APIs for replication configuration, control, and automation
  • Develop metadata tracking and change detection systems to enable efficient replication
  • Implement RPO visibility, alerting, and operational insights for replication status
  • Contribute to monitoring dashboards focused on replication health and performance
  • Ensure systems are designed for high availability, fault tolerance, and scalability
  • Partner with QA teams to drive performance, resiliency, and scale validation
  • Collaborate with backend, security, and platform teams to deliver end-to-end replication workflows
  • Participate in debugging, production issue resolution, and continuous improvement of replication reliability
  • Provide technical leadership, architectural guidance, and mentorship to the engineering team

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

251-500 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service