Senior Data Scientist (NLP)

Clarivate•Ann Arbor, MI

40d•$117,000 - $147,000•Remote

About The Position

We are seeking a Senior Data Scientist specializing in Natural Language Processing (NLP) and modern retrieval-augmented generation (RAG) architectures to join our Life Sciences & Healthcare (LS&H) team. This is an amazing opportunity to work on large-scale AI-enabled solutions that modernize and enhance our content delivery systems. You’ll be at the intersection of innovation, architecture, and real-world AI integration. The team consists of several domain and technical experts and reports to the VP of AI, Content. We would love to speak with you if you have deep expertise across text processing pipelines including indexing, vectorization, prompting, fine-tuning, summarization and context management and bring hands-on experience with frameworks like LangChain and LangGraph. Familiarity with architectures such as VRAG and GraphRAG is highly desirable. About You – Experience, Education, Skills, and Accomplishments Bachelor’s degree in Computer Science, Data Science, Computational Linguistics, or a related field At least 5 years of hands-on experience in data science, focused on natural language processing (NLP) At least 5 years of experience using Python, with expertise in NLP libraries such as LangChain, LangGraph, or other “Lang”-based toolkits Proven experience in model development and applying machine learning techniques to real-world problems It would be great if you also had: Expertise in retrieval-based LLM workflows (RAG, VRAG, GraphRAG) Deep understanding of embedding models, semantic search, and vector stores (e.g., FAISS, Pinecone) Experience with document loaders and text splitters/document splitting strategies Familiarity with MLOps practices and production-level deployment of AI pipelines Experience with cloud platforms (e.g., AWS, Azure, or GCP) Experience applying Graph Neural Networks (GNNs) to retrieval-enhanced generation Knowledge of LangSmith and vector orchestration platforms Familiarity with multilingual NLP and cross-lingual embeddings Exposure to real-time knowledge graphs and stream-based RAG systems A Master’s or PhD in a technical field (Computer Science, Data Science, etc.)

Requirements

Bachelor’s degree in Computer Science, Data Science, Computational Linguistics, or a related field
At least 5 years of hands-on experience in data science, focused on natural language processing (NLP)
At least 5 years of experience using Python, with expertise in NLP libraries such as LangChain, LangGraph, or other “Lang”-based toolkits
Proven experience in model development and applying machine learning techniques to real-world problems

Nice To Haves

Expertise in retrieval-based LLM workflows (RAG, VRAG, GraphRAG)
Deep understanding of embedding models, semantic search, and vector stores (e.g., FAISS, Pinecone)
Experience with document loaders and text splitters/document splitting strategies
Familiarity with MLOps practices and production-level deployment of AI pipelines
Experience with cloud platforms (e.g., AWS, Azure, or GCP)
Experience applying Graph Neural Networks (GNNs) to retrieval-enhanced generation
Knowledge of LangSmith and vector orchestration platforms
Familiarity with multilingual NLP and cross-lingual embeddings
Exposure to real-time knowledge graphs and stream-based RAG systems
A Master’s or PhD in a technical field (Computer Science, Data Science, etc.)

Responsibilities

Design NLP Workflows: Develop scalable pipelines for text ingestion, cleaning, normalization, and tokenization to support downstream applications.
Implement Indexing and Vectorization Strategies: Architect and maintain robust indexing systems and vector databases for semantic search and retrieval.
Develop Prompting and Finetuning Frameworks: Create reusable prompting strategies and lead fine-tuning initiatives for LLMs tailored to business-specific tasks.
Build LangChain/LangGraph Applications: Construct dynamic knowledge systems and agentic workflows using LangChain and LangGraph.
Integrate Advanced RAG Architectures: Apply VRAG and GraphRAG design patterns to enrich information retrieval and contextual understanding.
Conduct Performance Optimization: Perform benchmark testing and model evaluations to improve accuracy, efficiency, and scalability of NLP systems.
Collaborate Across Teams: Work closely with engineering, product, and research stakeholders to deliver integrated AI-driven features.
Provide Technical Leadership: Mentor junior data scientists, guide best practices, and drive innovation across AI projects.