BIBA Practice - Cloud Data Lead

HEXAWARE•United States,

9h•Onsite

About The Position

We are seeking a Senior Spark Engineer with strong Java expertise to design, develop, and operate high-performance, production-scale data processing pipelines. The role focuses on Apache Spark-based batch and streaming solutions, robust ETL, performance tuning, and close collaboration with data engineering, data science, and platform teams.

Requirements

5+ years of software engineering experience with at least 3+ years building production systems using Apache Spark.
Strong Java development skills (Java 8+); solid understanding of concurrent programming, memory management, and JVM tuning.
Production experience with Spark Core, Spark SQL, and Structured Streaming.
Hands-on experience with the Hadoop ecosystem components (HDFS, YARN, Hive) or cloud object storage (S3/GCS/Azure Blob).
Experience integrating with Kafka or other message brokers for real-time ingestion.
Experience with Scala or Python (PySpark) for cross-language integrations.
Java (primary): language proficiency, performance profiling, GC tuning.
Apache Spark: job design, RDD/DataFrame/Dataset APIs, Catalyst optimizer understanding.
Structured Streaming: exactly-once semantics, watermarking, state management.
Data storage: Hive, Parquet/ORC, Avro, schema evolution best practices.
Messaging & ingestion: Apache Kafka (producers/consumers), Connectors.
Orchestration & CI/CD: Airflow, Jenkins/GitHub Actions/GitLab CI or equivalent.
Containerization/cluster deployment: Yarn, Kubernetes experience for Spark on K8s.
Monitoring & observability: Prometheus/Grafana, ELK/EFK stack or Cloud-native equivalents.
Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience.

Responsibilities

Own design and development of scalable data pipelines using Apache Spark for batch and streaming workloads.
Implement Spark applications in Java (primary) and integrate with the broader data platform (HDFS/S3, Hive, Kafka, relational and NoSQL stores).
Optimize Spark jobs for performance, memory usage, and resource efficiency; troubleshoot production issues and reduce job failures/latency.
Develop reusable libraries, frameworks, and abstractions to accelerate data engineering work.
Implement data ingestion, transformation, and enrichment patterns, ensuring data quality, schema evolution handling, and idempotence.
Integrate Spark workloads with orchestration and scheduling systems (Airflow/Elasticsearch/Nifi or equivalent).
Build and maintain CI/CD pipelines, automated tests (unit/integration), and deployment practices for data applications.
Collaborate with data scientists to productionize models and feature engineering pipelines.
Drive observability and monitoring for Spark jobs (metrics, logging, alerting).
Mentor and review work of mid/junior engineers; participate in architecture and design reviews.
Ensure security, governance, and compliance requirements are met for data processing.