Data Engineer

Apple•Cupertino, CA

1d•Onsite

About The Position

Design, build, and maintain data pipelines that extract data from various sources such as databases (PostgreSQL, Cassandra, Iceberg, and hadoop ), APIs, data lakes, cloud storage or log files to Collect and consolidate data from multiple sources into a central data warehouse for reporting, analytics, and business intelligence purposes. Understand data sources, configure data extraction processes, manage data ingestion using Pyspark or Python, and automate the pipelines using Airflow to power data sources for analytics platforms like Tableau. Collaborate with machine learning engineers, data scientists, analysts, software engineers and managers to understand their data requirements and deliver them with reliable, distributed data pipelines to feed into data analytics and data visualization Platforms thereby allowing Apple’s stakeholders to easily leverage data in self-served manner. Perform data transformation tasks, including data cleaning, normalization, aggregation, and enrichment to prepare data for analytics and reporting pipelines. Utilize tools like SQL, scripting languages (Python), or ETL (Extract, Transform, Load) tools to manipulate and prepare data for predictive, statistical and trend analysis. Develop new and creative methodologies, such as self-optimizing data pipelines, and unified data pipeline that integrates and harmonizes data streams from various sources in real time, to evaluate test coverage and test pass rate to constantly improve Siri, by notifying and delivering feedback to engineering partners. Optimize existing data pipelines and database queries to improve performance and minimize latency of tableau dashboards. Identify and resolve bottlenecks, optimize data transformation processes, and implement indexing strategies to optimize data retrieval performance in databases.

Requirements

Master's degree or foreign equivalent in Computer Science, Engineering, Mathematics, Statistics, Business Analytics or a related field
2 years of experience in the job offered or related occupation
1 year of experience in Utilizing Tableau, including performing data preparation, data modeling, and data visualization
Experience monitoring metrics and assessing early signals or trends to identify or predict issues
Utilizing Java to build data infrastructure
Experience in Java libraries to improve data processing pipelines and experience processing large volume of data
Utilizing Microsoft Azure (or Google Cloud formerly known as GCP) to build, manage and analyze data at scale in a cloud environment
Utilizing Jupyter to prototype and explore data; and experience performing data manipulation, analysis and transformation
Experience extracting data from Iceberg, Postgres database and static excel (csv) files
Experience optimizing Extract, Transform, Load (ETL) pipelines
Utilizing Hadoop and Cassandra to store large volume of structure and unstructured data
Utilizing MySQL and postgreSQL to store relational data and query; and experience manipulating tables, performance tuning and completing database design
Utilizing Python and Spark to automate data-related workflows and processes
Experience transforming data using Numpy and Pandas
Experience performing statistical analysis and visualization

Responsibilities

Design, build, and maintain data pipelines that extract data from various sources such as databases (PostgreSQL, Cassandra, Iceberg, and hadoop ), APIs, data lakes, cloud storage or log files
Collect and consolidate data from multiple sources into a central data warehouse for reporting, analytics, and business intelligence purposes
Understand data sources, configure data extraction processes, manage data ingestion using Pyspark or Python, and automate the pipelines using Airflow
Collaborate with machine learning engineers, data scientists, analysts, software engineers and managers to understand their data requirements and deliver them with reliable, distributed data pipelines
Perform data transformation tasks, including data cleaning, normalization, aggregation, and enrichment to prepare data for analytics and reporting pipelines
Utilize tools like SQL, scripting languages (Python), or ETL (Extract, Transform, Load) tools to manipulate and prepare data for predictive, statistical and trend analysis
Develop new and creative methodologies, such as self-optimizing data pipelines, and unified data pipeline that integrates and harmonizes data streams from various sources in real time
Optimize existing data pipelines and database queries to improve performance and minimize latency of tableau dashboards
Identify and resolve bottlenecks, optimize data transformation processes, and implement indexing strategies to optimize data retrieval performance in databases