Data Engineer

Child Mind Institute•New York, NY

7h•Hybrid

About The Position

We're dedicated to transforming the lives of children and families struggling with mental health and learning disorders by giving them the help they need. We've become the leading independent nonprofit in children's mental health by providing gold-standard evidence-based care, delivering educational resources to millions of families each year, training educators in underserved communities, and developing tomorrow's breakthrough treatments. As part of the Center for Data Analytics, Innovation, and Rigor team, you will report to Rubric Engineering and Measurement Specialist. You will develop infrastructure to support large-scale AI evaluation frameworks. You will design scalable data pipelines for generating and processing synthetic data, implement secure data storage solutions, and create infrastructure for real-time model evaluation and monitoring. You will use common frameworks, platforms, and languages, such as Python, SQL, GitHub, containerization tools (e.g., Docker, Kubernetes), and cloud computing infrastructures (e.g., AWS, Azure) to build robust and scalable data infrastructure that support our AI research initiatives. This is an exempt, full-time, hybrid position located in our NYC headquarters office or other relevant location. This position requires a minimum of four (4) days per week in the office, on a schedule determined by your supervisor. The in-office requirement and schedule are subject to change based on the needs of the program and the organization.

Requirements

Master's degree in Neuroscience, Psychology, Engineering, Computer Science or equivalent combination of education and experience is required.
5+ years of experience in data analysis and data science fundamentals (e.g., algorithms, data structures, data visualization, machine learning), preferably in a clinical or research setting.
5+ years of experience in at least one scientific programming language (e.g., Python/R, Matlab) and related toolboxes or frameworks (e.g., Tidyverse, Scipy, Sklearn, Polars, Pytorch) is required.
5+ years of experience working in a Linux environment, using version control systems (e.g., GitHub), and software virtualization platforms (e.g., Docker).
5+ years of practical experience in Extract, Transform, Load (ETL) processes and database management languages (SQL,NoSQL), and familiarity with associated cloud computing services and frameworks (AWS, Azure, Terraform).

Responsibilities

Create and maintain scalable data pipelines for efficient storage and retrieval of multimodal data, with particular emphasis on clinical, natural language, and multi-turn response data.
Create pipelines for data transformation, preprocessing, and management.
Ensure data quality, security, and compliance with privacy regulations for handling sensitive data.
Perform quality assurance of pipelines/processes to maintain integrity throughout the data lifecycle.
Create interactive visualizations and dashboards to communicate data insights and pipeline performance metrics.
Write documentation and relevant text for scientific, clinical, or public dissemination of knowledge.
Perform additional job-related duties as assigned.