Unstructured Data Engineer

Leidos

1d•Remote

About The Position

The Leidos Digital Modernization Sector is seeking an Unstructured Data Engineer; this position will allow for full time telework from any U.S. based location POSITION SUMMARY: We are seeking a highly skilled and innovative Unstructured Data Engineer to lead the design, implementation, and operationalization of unstructured data pipelines supporting Retrieval-Augmented Generation (RAG) and enterprise AI initiatives. This role will serve as the technical expert responsible for transforming raw, unstructured content into trusted, governed, AI-ready data products. The ideal candidate has deep experience in RAG architectures, document preprocessing, metadata enrichment, vectorization, and embedding workflows, and understands how to operationalize these capabilities at enterprise scale. Experience with Ohalo Data xRay or similar unstructured data processing platforms is strongly preferred.

Requirements

Bachelor’s degree in Computer Science, Data Engineering, AI/ML, or related field and 8+ years of relevant experience.
Hands-on experience designing and implementing RAG architectures in production environments.
Experience working with unstructured data (PDFs, documents, email, transcripts, images with OCR, etc.).
Strong proficiency in Python and experience with NLP/LLM frameworks (e.g., LangChain, LlamaIndex, Hugging Face, OpenAI APIs).
Experience with vector databases (e.g., Pinecone, Weaviate, FAISS, OpenSearch, Azure AI Search).
Experience implementing document chunking, embedding generation, and similarity search.
Understanding of metadata modeling and governance principles.
Experience building scalable data pipelines in cloud environments (AWS, Azure, or GCP).
Hands-on experience with prompt engineering, evaluation metrics, and context window optimization.
Strong understanding of multi-modal data processing and pipeline engineering.
Strong knowledge of API integration and microservices architecture.
US Citizenship is required.

Nice To Haves

Experience with Ohalo Data xRay or similar unstructured data discovery and redaction platforms.
Experience aligning RAG pipelines with enterprise Data Governance frameworks (e.g., Collibra).
Familiarity with data classification, CUI/PII handling, and redaction controls.
Experience packaging datasets as governed data products with defined SLAs and stewardship.
Experience integrating AI-ready datasets into enterprise tools such as ChatGPT Enterprise, AskSage, or similar AI copilots.
Understanding of model evaluation metrics for retrieval quality (precision, recall, MRR, hallucination reduction).
Experience working in regulated or government environments.
Familiarity with MLOps practices and AI lifecycle management.
Experience optimizing infrastructure costs for embedding and vector storage workloads.
Awareness of AI/ML lifecycle management practices, including model evaluation, monitoring, versioning, governance, and responsible AI considerations in production environments.
Familiarity with Model Context Protocol (MCP) concepts and agentic architectures, including tool orchestration, memory management, and multi-step reasoning workflows.
Exposure to Knowledge Graph and graph database technologies (e.g., Neo4j, RDF/SPARQL, property graphs) and their application in semantic search, entity resolution, and AI context enhancement.

Responsibilities

Design, build, and manage end-to-end RAG pipelines for enterprise AI applications.
Lead preprocessing of unstructured data, including discovery, classification, cleansing, redaction, and metadata enrichment.
Develop and optimize document chunking, embedding, and vectorization strategies for structured and unstructured datasets.
Coordinate ingestion of curated datasets into vector databases and AI platforms.
Package curated unstructured datasets as governed, reusable data products for enterprise consumption.
Define and implement metadata tagging strategies to align with Collibra governance standards.
Partner with Data Governance and Data Quality teams to ensure AI-ready data meets enterprise standards for lineage, classification, and compliance.
Evaluate and optimize embedding models, retrieval strategies, and indexing performance.
Monitor and tune RAG pipeline performance, including latency, retrieval accuracy, and cost efficiency.
Implement automation for document ingestion, transformation, and publishing workflows.
Support integration with enterprise AI platforms (e.g., ChatGPT Enterprise, AskSage, Moveworks).
Conduct cost analysis and capacity planning for vector storage and processing workloads.
Provide technical guidance on AI data readiness and unstructured data lifecycle management.
Design, implement, and optimize enterprise-grade RAG and prompt engineering frameworks, including context engineering strategies (chunking, metadata enrichment, semantic filtering, dynamic context management) to improve retrieval accuracy, grounding, and response quality.
Develop and maintain scalable multi-modal data pipelines that ingest, preprocess, embed, and integrate text, documents, images, audio, and structured data into governed vectorized data products consumable by enterprise AI platforms.