Data Scientist

Ardent Principles, Inc.•,

23h•Onsite

About The Position

We’re looking for a Data Scientist who can build and operate production‑grade data pipelines, deliver scalable ETL/ELT workflows, and solve complex analytical challenges in cloud environments. In this role, you’ll work with PySpark and distributed processing, design optimized SQL solutions, and develop secure, well‑engineered data applications using Python and modern orchestration tools. Ardent Principles offers advanced services in data science, data engineering, software engineering, AI solutions, cybersecurity, staff augmentation, and IT program management. We offer a competitive salary range and a comprehensive, industry‑leading benefits package designed to support long‑term stability and employee well‑being. We provide more than a position—we offer a workplace committed to excellence, integrity, and mission‑focused impact. Our mission is to act as a bridge between satisfied clients and fulfilled employees, ensuring that your job and well-being are our top priorities because your satisfaction leads to the success of our clients. Join us as we continue building the future of secure, high‑impact solutions.

Requirements

Active TS/SCI with Full Scope Polygraph
Full-time onsite in McLean, VA
Building production data pipelines and ETL/ELT workflows at scale.
Using Apache Spark and PySpark for distributed data processing.
Advanced Python programming skills including data manipulation libraries (Pandas, NumPy) and data engineering best practices.
Understanding data security, privacy, governance, and compliance principles.
Workflow orchestration tools such as Step Functions and Airflow.
Containerization such as Docker or Podman, and deploying data applications in cloud environments.
AWS services (in particular S3, Lambda, and Step Functions).
PostgreSQL and MySQL in production environments, including performance tuning and schema design.
SQL and query optimization for complex analytical workloads.
Version control (Git) and CI/CD practices for data pipelines.
Working with stakeholders to understand data requirements, assess feasibility, and design appropriate solutions with minimal oversight.
Strong problem-solving and debugging skills for data quality issues, pipeline failures, and performance bottlenecks.

Nice To Haves

Data lakehouse architectures using Apache Iceberg.
Configuring, deploying, and integrating data platform components: Apache Ranger (access control and data governance), Trino (distributed SQL query engine), Data catalogs (Unity Catalog OSS, Apache Polaris, etc.), and Apache Superset (data visualization and dashboarding).
Bash scripting for automation and data processing tasks.
Infrastructure as Code (Terraform or CloudFormation) for data infrastructure.
Tracking data lineage and associated tooling such as OpenLineage.
Using Java.
Data quality frameworks, testing methodologies, and validation strategies.
Background with large-scale data migrations or platform modernization efforts.
Integrating AI/ML services and models (translation, OCR, speech-to-text, NLP, language detection, topic modeling), LLMs, and RAG (retrieval-augmented generation) pipelines.
Geospatial data processing (H3, PostGIS, or similar).
Contributing to data engineering documentation, best practices, or design patterns.
NoSQL databases (DynamoDB, etc.).
Excellent written and verbal communication skills with both technical and non-technical audiences.
Linux Operating Systems
Agile/Scrum development methodologies in a fast-paced, collaborative team environment.
Working effectively in high-performing, cross-functional teams with multiple concurrent projects.
Working directly with stakeholders to gather requirements, understand needs, and translate them into technical solutions with minimal oversight.
Self-directed work with a strong ownership mentality and commitment to code quality, testing, and documentation.
Context-switching between projects and systems as priorities demand.

Responsibilities

Building production data pipelines and ETL/ELT workflows at scale.
Using Apache Spark and PySpark for distributed data processing.
Advanced Python programming skills including data manipulation libraries (Pandas, NumPy) and data engineering best practices.
Understanding data security, privacy, governance, and compliance principles.
Workflow orchestration tools such as Step Functions and Airflow.
Containerization such as Docker or Podman, and deploying data applications in cloud environments.
AWS services (in particular S3, Lambda, and Step Functions).
PostgreSQL and MySQL in production environments, including performance tuning and schema design.
SQL and query optimization for complex analytical workloads.
Version control (Git) and CI/CD practices for data pipelines.
Working with stakeholders to understand data requirements, assess feasibility, and design appropriate solutions with minimal oversight.
Strong problem-solving and debugging skills for data quality issues, pipeline failures, and performance bottlenecks.