Summer Internship - Data Science / Data Engineering

MKThink•San Francisco, CA

10d

About The Position

We are seeking a Data Science or Data Engineering intern (graduate student preferred, advanced undergraduate considered) to support development of an unstructured data extraction pipeline. The intern will build systems that ingest heterogeneous documents, identify relevant information, map extracted content to a target schema, and improve output accuracy through iterative user feedback. MKThink is a future-forward design firm grounded in spatial intelligence and dedicated to “build less, solve more.” Our data-informed solutions improve human performance at less operational, environmental, and capital costs than conventional approaches. Founded in 2000, MKThink practices from the Pacific Edge of San Francisco to the Oceanic Edge of O’ahu. At MKThink, we believe that we can play a role in helping create a better and more sustainable future by creating intelligent spaces that improve the quality of life. Our greatest resource is our staff and their ability to contribute fully as teammates and individuals. We bring together thinkers from various disciplines to solve problems at the nexus of architecture, culture, and the environment. Our people have the interdisciplinary skills to contribute to this mission within and across the domains of architecture, strategies, and innovation. The internship involves building an end-to-end pipeline to extract and structure data from heterogeneous, unstructured documents (e.g., PDFs with high format variance). Work includes document parsing, ML/NLP-based extraction, schema alignment, and confidence scoring. The intern will implement a human-in-the-loop feedback system to iteratively improve accuracy (target ">="90% extraction & mapping accuracy, "<="3 iteration convergence).

Requirements

Strong Python and experience with data pipelines, machine learning, or unstructured data processing

Nice To Haves

Graduate student preferred in Data Science, Computer Science, Engineering, or related field
Experience working with unstructured data, document intelligence, information extraction, or schema mapping
Comfortable working on applied modeling and data engineering problems with ambiguous inputs and variable document formats
Proactive and highly self-motivated, able to operate independently with minimal guidance and supervision

Responsibilities

Build data pipelines for ingesting and processing unstructured documents, including PDFs with inconsistent structure and content
Develop extraction workflows that combine document parsing, feature engineering, ML/NLP methods, and rule-based logic to identify relevant fields
Design methods to evaluate what content is useful, discard irrelevant content, and align extracted information to a predefined schema
Implement confidence scoring, validation, and error-handling logic to improve extraction accuracy and reliability
Build a human-in-the-loop feedback workflow where users can confirm, reject, or correct extracted fields and trigger reruns toward improved output

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume