Data Engineer

J5 Consulting•Cedar View, VA

4d•Onsite

About The Position

The Sponsor’s office is architecting and creating a secure data ecosystem that aligns with the Sponsor’s data strategy and mission capabilities. This includes developing platforms for enterprise search, digital forensics, and data analytics. The role involves leveraging modern, data-centric approaches to maximize the value extracted from existing data.

Requirements

Demonstrated experience with Agile/Scrum development methodologies in a fast-paced, collaborative team environment.
Demonstrated experience working effectively in high-performing, cross-functional teams with multiple concurrent projects.
Demonstrated experience working directly with stakeholders to gather requirements, understand needs, and translate them into technical solutions with minimal oversight.
Demonstrated experience in self-directed work with a strong ownership mentality and commitment to code quality, testing, and documentation.
Demonstrated experience context-switching between projects and systems as priorities demand.
Demonstrated experience building production data pipelines and ETL/ELT workflows at scale.
Demonstrated experience with Apache Spark and PySpark for distributed data processing.
Demonstrated experience with advanced Python programming skills including data manipulation libraries (Pandas, NumPy) and data engineering best practices.
Demonstrated experience understanding data security, privacy, governance, and compliance principles.
Demonstrated experience with workflow orchestration tools such as Step Functions and Airflow.
Demonstrated experience with containerization such as Docker or Podman, and deploying data applications in cloud environments.
Demonstrated experience with AWS services (in particular S3, Lambda, and Step Functions).
Demonstrated experience with PostgreSQL and MySQL in production environments, including performance tuning and schema design.
Demonstrated experience with SQL and query optimization for complex analytical workloads.
Demonstrated experience with version control (Git) and CI/CD practices for data pipelines.
Demonstrated experience working with stakeholders to understand data requirements, assess feasibility, and design appropriate solutions with minimal oversight.
Demonstrated experience with strong problem-solving and debugging skills for data quality issues, pipeline failures, and performance bottlenecks.
US Citizenship.
Active U.S. Government Top Secret Security Clearance with a Full Scope Polygraph.

Nice To Haves

Demonstrated experience with data lakehouse architectures using Apache Iceberg.
Demonstrated experience configuring, deploying, and integrating data platform components: Apache Ranger (access control and data governance); Trino (distributed SQL query engine); Data catalogs (Unity Catalog OSS, Apache Polaris, etc.); Apache Superset (data visualization and dashboarding).
Demonstrated experience with Bash scripting for automation and data processing tasks.
Demonstrated experience with Infrastructure as Code (Terraform or CloudFormation) for data infrastructure.
Demonstrated experience with tracking data lineage and associated tooling such as OpenLineage.
Demonstrated experience with Java.
Demonstrated experience with data quality frameworks, testing methodologies, and validation strategies.
Demonstrated experience or background with large-scale data migrations or modernization efforts.
Demonstrated experience integrating AI/ML services and models (translation, OCR, speech-to-text, NLP, language detection, topic modeling), LLMs, and RAG (retrieval-augmented generation) pipelines.
Demonstrated experience with geospatial data processing (H3, PostGIS, or similar).
Demonstrated experience contributing to data engineering documentation, best practices, or design patterns.
Demonstrated experience with NoSQL databases (DynamoDB, etc.).
Demonstrated experience with excellent written and verbal communication skills with both technical and non-technical audiences.

Responsibilities

Building production data pipelines and ETL/ELT workflows at scale.
Utilizing Apache Spark and PySpark for distributed data processing.
Applying advanced Python programming skills, including data manipulation libraries (Pandas, NumPy) and data engineering best practices.
Understanding and implementing data security, privacy, governance, and compliance principles.
Using workflow orchestration tools such as Step Functions and Airflow.
Containerizing applications using Docker or Podman and deploying data applications in cloud environments.
Leveraging AWS services, particularly S3, Lambda, and Step Functions.
Working with PostgreSQL and MySQL in production environments, including performance tuning and schema design.
Optimizing SQL queries for complex analytical workloads.
Using version control (Git) and CI/CD practices for data pipelines.
Working with stakeholders to understand data requirements, assess feasibility, and design appropriate solutions with minimal oversight.
Troubleshooting data quality issues, pipeline failures, and performance bottlenecks.
Configuring, deploying, and integrating data platform components such as Apache Ranger, Trino, data catalogs (Unity Catalog OSS, Apache Polaris, etc.), and Apache Superset.
Writing Bash scripts for automation and data processing tasks.
Using Infrastructure as Code (Terraform or CloudFormation) for data infrastructure.
Tracking data lineage and utilizing associated tooling like OpenLineage.
Integrating AI/ML services and models (translation, OCR, speech-to-text, NLP, language detection, topic modeling), LLMs, and RAG pipelines.
Processing geospatial data using H3, PostGIS, or similar tools.
Contributing to data engineering documentation, best practices, or design patterns.
Working with NoSQL databases (DynamoDB, etc.).
Communicating effectively with both technical and non-technical audiences.