Data Scientist 3

Gormat

32d

About The Position

We are seeking a Data Scientist proficient in Python and Jupyter Notebook to support a Natural Language Processing (NLP) project to accurately and automatically tokenize language data with spoken or written origins. You will develop automated solutions for the annotation of language data with parts of speech information and improve existing models by scoring performance against human-generated annotations for speech and text. The Level 3 Data Scientist shall possess the following capabilities: Foundations: (Mathematical, Computational, Statistical). Data Processing: (Data management and curation, data description and visualization, workflow and reproducibility). Modeling, Inference, and Prediction: (Data modeling and assessment, domain-specific considerations). Ability to make and communicate principal conclusions from data using elements of mathematics, statistics, computer science, and applications-specific knowledge. Ability to use analytic modeling, statistical analysis, programming, and/or another appropriate scientific method, develop and implement qualitative and quantitative methods for characterizing, exploring, and assessing large datasets in various states of organization, cleanliness, and structure that account for the unique feature and limitations inherent in Government data holdings. Translate practical mission needs and analytic questions related to large datasets into technical requirements and, conversely, assist others with drawing appropriate conclusions from the analysis of such data. Effectively communicate complex technical information to non-technical audiences. Ability to train and develop NLP/NER for LLM solutions within an agentic AI framework (LangGraph). Must be able to perform supervised and unsupervised model training and validation for automated knowledge extraction from unstructured natural language data in multiple languages without a predefined ontology. Familiarity with customer data sources and data retrieval techniques is necessary for producing preprocessed training data, which will require an understanding of techniques to ensure data quality and readiness for integration into the system. Understanding of enterprise data compliance and policy concerns are necessary to ensure solutions are built for end user access.

Requirements

Proficiency in Python and Jupyter Notebook.
Natural Language Processing (NLP) experience.
TS/SCI with polygraph clearance.

Responsibilities

Develop automated solutions for the annotation of language data with parts of speech information.
Improve existing models by scoring performance against human-generated annotations for speech and text.
Make and communicate principal conclusions from data using elements of mathematics, statistics, computer science, and applications-specific knowledge.
Use analytic modeling, statistical analysis, programming, and/or another appropriate scientific method, develop and implement qualitative and quantitative methods for characterizing, exploring, and assessing large datasets in various states of organization, cleanliness, and structure that account for the unique feature and limitations inherent in Government data holdings.
Translate practical mission needs and analytic questions related to large datasets into technical requirements.
Assist others with drawing appropriate conclusions from the analysis of such data.
Effectively communicate complex technical information to non-technical audiences.
Train and develop NLP/NER for LLM solutions within an agentic AI framework (LangGraph).
Perform supervised and unsupervised model training and validation for automated knowledge extraction from unstructured natural language data in multiple languages without a predefined ontology.
Produce preprocessed training data, requiring an understanding of techniques to ensure data quality and readiness for integration into the system.
Ensure solutions are built for end user access, understanding enterprise data compliance and policy concerns.