Senior Data Engineer – Real-Time & Distributed Systems (GCP)

Innodata Inc•Ridgefield Park, NJ

About The Position

Innodata (NASDAQ: INOD) is a leading data engineering company. With more than 2,000 customers and operations in 13 cities around the world, we are the AI technology solutions provider-of-choice to 4 out of 5 of the world’s biggest technology companies, as well as leading companies across financial services, insurance, technology, law, and medicine. By combining advanced machine learning and artificial intelligence (ML/AI) technologies, a global workforce of subject matter experts, and a high-security infrastructure, we’re helping usher in the promise of clean and optimized digital data to all industries. Innodata offers a powerful combination of both digital data solutions and easy-to-use, high-quality platforms. Our global workforce includes over 3,000 employees in the United States, Canada, United Kingdom, the Philippines, India, Sri Lanka, Israel and Germany. We’re poised for a period of explosive growth over the next few years.

Requirements

Advanced proficiency in Python for backend and large-scale data processing
Strong experience building and managing big data pipelines in production environments
Hands-on expertise with workflow orchestration tools such as Airflow or Google Cloud Composer
Proven experience in batch and streaming data processing using: Apache Spark Apache Beam (Dataflow)
Experience designing and operating event-driven systems using Pub/Sub
Strong understanding of distributed systems architecture and scalability patterns
Experience managing globally distributed, low-latency datasets
Hands-on experience with NoSQL databases and/or Google Cloud Spanner
Strong knowledge of system reliability, fault tolerance, and performance optimization

Nice To Haves

Proficiency in Go, Java, or Scala
Experience with Kafka or Flume for streaming ingestion
Deep familiarity with the Google Cloud Platform ecosystem
Experience with production monitoring, logging, and observability frameworks
Exposure to high-availability, multi-region deployments

Responsibilities

Design, build, and optimize scalable data pipelines for batch and real-time processing
Develop and maintain event-driven architectures for high-throughput systems
Ensure data reliability, performance, and low-latency processing across distributed environments
Collaborate with data scientists and application teams to enable analytics and AI use cases
Implement best practices in performance tuning, monitoring, and cost optimization

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume