Data Scientist (TS/SCI with Poly Required)

GCI Careers•McLean, VA

12h

About The Position

GCI embodies excellence, integrity and professionalism. The employees supporting our customers deliver unique, high-value mission solutions while effectively leveraging the technological expertise of our valued workforce to meet critical mission requirements in the areas of Data Analytics and Software Development, Engineering, Targeting and Analysis, Operations, Training, and Cyber Operations. We maximize opportunities for success by building and maintaining trusted and reliable partnerships with our customers and industry. At GCI, we solve the hard problems.

Requirements

Demonstrated experience building production data pipelines and ETL/ELT workflows at scale.
Demonstrated experience with Apache Spark and PySpark for distributed data processing.
Demonstrated experience with advanced Python programming skills including data manipulation libraries (Pandas, NumPy) and data engineering best practices.
Demonstrated experience understanding data security, privacy, governance, and compliance principles.
Demonstrated experience with workflow orchestration tools such as Step Functions and Airflow.
Demonstrated experience with containerization such as Docker or Podman, and deploying data applications in cloud environments.
Demonstrated experience with AWS services (in particular S3, Lambda, and Step Functions).
Demonstrated experience with PostgreSQL and MySQL in production environments, including performance tuning and schema design.
Demonstrated experience with SQL and query optimization for complex analytical workloads.
Demonstrated experience with version control (Git) and CI/CD practices for data pipelines.
Demonstrated experience working with stakeholders to understand data requirements, assess feasibility, and design appropriate solutions with minimal oversight.
Demonstrated experience with strong problem-solving and debugging skills for data quality issues, pipeline failures, and performance bottlenecks.
US Citizen
Active/current TS/SCI with Polygraph clearance

Nice To Haves

Demonstrated experience with data lakehouse architectures using Apache Iceberg.
Demonstrated experience configuring, deploying, and integrating data platform components: Apache Ranger (access control and data governance), Trino (distributed SQL query engine), Data catalogs (Unity Catalog OSS, Apache Polaris, etc.), Apache Superset (data visualization and dashboarding).
Demonstrated experience with Bash scripting for automation and data processing tasks.
Demonstrated experience with Infrastructure as Code (Terraform or CloudFormation) for data infrastructure.
Demonstrated experience with tracking data lineage and associated tooling such as OpenLineage.
Demonstrated experience with Java.
Demonstrated experience with data quality frameworks, testing methodologies, and validation strategies.
Demonstrated experience or background with large-scale data migrations or platform modernization efforts.
Demonstrated experience integrating AI/ML services and models (translation, OCR, speech-to-text, NLP, language detection, topic modeling), LLMs, and RAG (retrieval-augmented generation) pipelines.
Demonstrated experience with geospatial data processing (H3, PostGIS, or similar).
Demonstrated experience Contributing to data engineering documentation, best practices, or design patterns.
Demonstrated experience with NoSQL databases (DynamoDB, etc.).
Demonstrated experience with excellent written and verbal communication skills with both technical and non-technical audiences.

Responsibilities

Building production data pipelines and ETL/ELT workflows at scale.
Utilizing Apache Spark and PySpark for distributed data processing.
Applying advanced Python programming skills including data manipulation libraries (Pandas, NumPy) and data engineering best practices.
Understanding and applying data security, privacy, governance, and compliance principles.
Using workflow orchestration tools such as Step Functions and Airflow.
Deploying data applications in cloud environments using containerization such as Docker or Podman.
Leveraging AWS services (in particular S3, Lambda, and Step Functions).
Working with PostgreSQL and MySQL in production environments, including performance tuning and schema design.
Writing and optimizing SQL queries for complex analytical workloads.
Utilizing version control (Git) and CI/CD practices for data pipelines.
Working with stakeholders to understand data requirements, assess feasibility, and design appropriate solutions with minimal oversight.
Solving problems and debugging data quality issues, pipeline failures, and performance bottlenecks.
Configuring, deploying, and integrating data platform components: Apache Ranger (access control and data governance), Trino (distributed SQL query engine), Data catalogs (Unity Catalog OSS, Apache Polaris, etc.), Apache Superset (data visualization and dashboarding).
Automating tasks and processing data using Bash scripting.
Implementing Infrastructure as Code (Terraform or CloudFormation) for data infrastructure.
Tracking data lineage and associated tooling such as OpenLineage.
Integrating AI/ML services and models (translation, OCR, speech-to-text, NLP, language detection, topic modeling), LLMs, and RAG (retrieval-augmented generation) pipelines.
Processing geospatial data (H3, PostGIS, or similar).
Contributing to data engineering documentation, best practices, or design patterns.
Communicating effectively with both technical and non-technical audiences.