Design and develop EMR pipelines by using AWS services like SQS QUEUE, EC2 instances, AWS data pipeline, S3 buckets, AWS glue, RDS and others. Create extract-transform-load (ETL) EMR pipelines based on HADOOP, hive, Yarn resource manager, NIFI, spark and python frameworks in AWS. Interpret the data mapping document to identify the source systems like SQL server and develop required spark transformations for ingesting the ETL data into titan platform. Optimize spark jobs using Pyspark after complete analysis of multiple parameters and opportunities to improve the target systems along with data quality checks. Create spark data frames/RDD's and load the data in different formats JSON, Parquet, AVRO, CSV and others. Evaluate new architectures and technologies such as snowflake, Debezium and other tools to improve performance and efficiency of ETL tasks. Responsible for completing the data requests from CDK product customers and help them debug and resolve any data quality issues in a timely manner. Work closely with the CDK customers and provide the feedback to the CDK service team to improve reliability of our products. Work in the scaled agile methodologies to increase the quality of the deliverables. Monitor and resolve production 11/12 issues. 100% Telecommuting.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior