Machine Learning Engineer

Children’s Hospital of Philadelphia•Philadelphia, PA

33d

About The Position

The Campbell Laboratory at the Children’s Hospital of Philadelphia is seeking a Machine Learning Engineer to help advance our mission to diagnose rare genetic diseases more quickly and accurately. We develop and train large language models (LLMs) to better understand clinical data from the electronic health record (EHR) and to identify ways to facilitate accurate, equitable diagnoses for every child—especially those from historically marginalized backgrounds. As a Machine Learning Engineer, you will work closely with data scientists, clinicians, and other researchers to design, implement, and scale state-of-the-art machine learning workflows. You will utilize our on-premises GPU/SLURM cluster and cloud-based TPU instances (Google Cloud) to train and deploy LLMs using Hugging Face Transformers, PyTorch, and JAX. This role combines robust software engineering practices with advanced machine learning and natural language processing (NLP) techniques, with a focus on reproducibility and high-quality code. Our innovative and interdisciplinary environment values diversity, fosters professional growth, and drives impactful research that benefits children worldwide. If you are passionate about building robust machine learning systems, enjoy working on high-impact problems, and thrive in a collaborative research environment, we encourage you to apply.

Requirements

At least three (3) years experience with progressively more complex data science, applied statistics, machine learning, or mathematical modeling projects.
Proven software engineering experience, including structured development methods, testing, and version control.
Hands-on experience with Python and at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX).
Familiarity with relational databases (e.g., Snowflake, BigQuery, Oracle SQL, MySQL).
Experience with Linux/Unix environments, shell scripting, and cluster computing systems (e.g., SLURM).
Bachelor's Degree Required
Bachelor's Degree Analytics, Data Science, Statistics, Mathematics, Computer Science or a related field Preferred
Masters or PhD in Analytics, Data Science, Statistics, Mathematics, Computer Science or a related field Preferred

Nice To Haves

At least four (4) years with progressively more complex data science, applied statistics, machine learning, or mathematical modeling projects Preferred
At least one year of experience with complex data science, applied statistics, machine learning, or mathematical modeling projects Preferred
Natural language processing experience, particularly in the biological and medical domains Preferred
Experience with transformer architecture and associated software (e.g., PyTorch, Tensorflow, JAX) is Preferred
Experience using distributed computing technologies Preferred
Experience with cloud virtual machine environments Preferred
Experience implementing distributed training on GPUs or TPUs in cloud platforms (e.g., Google Cloud, AWS, Azure).
Experience with prompt engineering, semantic search, or retrieval-augmented generation (RAG) in a research or production environment.
Familiarity with MLOps pipelines (CI/CD, containerization, monitoring, and logging frameworks).
Exposure to healthcare or biomedical data and associated privacy/security regulations (e.g., HIPAA) is a plus.
Experience with advanced NLP techniques or LLMs in a research or production environment.

Responsibilities

Configure and utilize on-premises SLURM cluster with GPU resources to ensure efficient and reliable job scheduling for large-scale model training.
Manage and optimize cloud-based infrastructures (e.g., TPU Pods on Google Cloud) for distributed model training and evaluation.
Collaborate with data scientists to implement and fine-tune LLMs (e.g., Transformer architectures in PyTorch, TensorFlow, or JAX) for clinical and biomedical NLP tasks.
Develop efficient training pipelines, including data loading, preprocessing, feature extraction, and model deployment.
Evaluate model performance and optimize hyperparameters, GPU/TPU utilization, and distributed training strategies.
Collaborate cross-functionally with clinicians, data scientists, analysts, and IT teams to support and enhance machine learning operations (MLOps).
Work with relational databases (e.g., Snowflake, BigQuery, Oracle SQL, MySQL) and distributed storage systems to access and manage EHR data.
Partner with data scientists and domain experts to design data pipelines that integrate with existing hospital systems.
Write clean, well-documented, and maintainable code following best practices
Contribute to shared code repositories using Git, ensuring reproducibility and version control for collaborative projects.
Develop CI/CD workflows to automate model testing, containerization, and deployment to production environments.
Monitor deployed models for performance drift, latency, and reliability, and implement automated alerts and feedback loops to refine model behavior.
Produce clear technical documentation, including system architecture diagrams, training procedures, and user guides for internal stakeholders.
Present engineering best practices, findings, and process updates to clinicians, researchers, and other non-technical audiences as needed.