AI/ML Data Scientist Intern

Command Post Technologies, Inc.•Suffolk, VA

2h•Onsite

About The Position

We are looking for a curious and driven AI/ML Data Scientist Intern to join our team in Suffolk, Virginia. This internship offers a hands-on opportunity for students or early-career professionals with a foundation in Computer Science to gain real-world experience in artificial intelligence, machine learning, and data science. You will work alongside experienced engineers and data professionals to build, fine-tune, and deploy machine learning models, construct retrieval-augmented generation pipelines, and curate high-quality datasets that support organizational objectives.

Requirements

Linux Foundations – Basic understanding of Linux operating systems, including file system navigation, user management, permissions, and command-line operations.
Python Basics – Foundational proficiency in Python programming, including the ability to write scripts, work with libraries, manipulate data structures, and debug code.
Agentic AI – Familiarity with the concepts and architecture behind agentic AI systems, including how autonomous agents plan, reason, and execute multi-step tasks.
Hugging Face – Experience navigating the Hugging Face ecosystem, including the ability to load pre-trained models, tokenizers, and datasets from the Hugging Face Hub.
Dataset Curation – Understanding of how to source, clean, label, and organize datasets for machine learning training and evaluation purposes.
LoRA Fine-Tuning – Knowledge of Low-Rank Adaptation (LoRA) techniques for efficiently fine-tuning large language models with reduced computational overhead.
RAG Pipelines – Understanding of retrieval-augmented generation architecture, including how to connect language models with external knowledge sources to improve response accuracy.
Document Extraction – Familiarity with techniques and tools for extracting structured data from unstructured documents such as PDFs, scanned images, and web pages.
Chunking Strategies – Knowledge of methods for splitting large documents into smaller, semantically coherent segments optimized for embedding and retrieval.
Embedding Models – Understanding of how text embedding models work and how they are used to represent documents as vectors for similarity search and retrieval applications.
Basic Networking – Understanding of core networking concepts including IP addresses, subnetting, the OSI model, and the functional differences between Layer 2 and Layer 3 protocols.
Azure Virtual Desktop Concepts – Familiarity with Azure Virtual Desktop components, including Host Pools, Workspaces, and Application Groups.
HTML, JavaScript, React – Foundational knowledge of front-end web technologies, including the ability to read and understand HTML structure, JavaScript logic, and React component architecture.

Nice To Haves

Vector Databases – Experience working with vector database platforms such as Pinecone, Weaviate, or ChromaDB for storing and querying high-dimensional embeddings.
LangChain or LlamaIndex – Familiarity with orchestration frameworks used to build applications powered by large language models.
Prompt Engineering – Knowledge of techniques for crafting effective prompts to guide large language model behavior and improve output quality.
MLOps and Model Deployment – Experience with tools and workflows for packaging, deploying, and monitoring machine learning models in production environments.
Docker & Containerization – Basic understanding of container concepts and experience running applications in Docker or Kubernetes environments.
Transformer Architectures – Understanding of the transformer model architecture, including self-attention mechanisms and how they power modern language models.
Data Annotation and Labeling – Experience with data annotation workflows and labeling tools used to prepare supervised learning datasets.
Evaluation Metrics for Generative AI – Knowledge of how to assess the quality of generative AI outputs using metrics such as BLEU, ROUGE, perplexity, or human evaluation frameworks.
Cloud Platforms for ML Workloads – Exposure to cloud-based machine learning services on AWS, GCP, or Azure for training, hosting, and scaling models.
Version Control Systems (Git) – Familiarity with Git workflows for managing code, collaborating with teams, and tracking project history.
Microsoft EntraID – Familiarity with Microsoft’s identity and access management platform for managing user authentication and permissions.
API Calls – Experience making and testing API calls using tools such as Postman, cURL, or similar utilities.
Azure Services – Broader exposure to Azure services beyond the fundamentals, such as Azure Storage, Azure Networking, or Azure Active Directory.
Node.js / .NET API – Experience building or consuming APIs using Node.js or the .NET framework.
Azure Serverless Functions – Familiarity with event-driven, serverless computing in Azure for running lightweight backend processes.
Visio or Other Drawing Application – Ability to create data flow diagrams, system architecture visuals, or workflow documentation using Microsoft Visio or comparable tools such as draw.io or Lucidchart.

Responsibilities

Assist in the development and fine-tuning of large language models using techniques such as LoRA to optimize model performance for specific use cases.
Support the design and implementation of retrieval-augmented generation (RAG) pipelines to enhance AI-driven applications with relevant, contextual data.
Curate, clean, and prepare datasets for training and evaluation, ensuring data quality and relevance across projects.
Work with embedding models to convert text and documents into vector representations for search and retrieval systems.
Develop and refine chunking strategies for processing large documents into manageable, semantically meaningful segments.
Extract structured information from unstructured documents using automated document extraction techniques.
Build and experiment with agentic AI workflows that enable autonomous task execution and decision-making.
Contribute to front-end interfaces and internal tools using HTML, JavaScript, and React to support data visualization and model interaction.
Document processes, experiments, and findings for internal knowledge sharing and reproducibility.