Lead Data Science Engineer

Dassault SystèmesNew York, NY
2d$135,000 - $180,000Hybrid

About The Position

Medidata is powering smarter treatments and healthier people through digital solutions to support clinical trials. Celebrating 25 years of ground-breaking technological innovation across more than 36,000 trials and 11 million patients, Medidata offers industry-leading expertise, analytics-powered insights, and one of the largest clinical trial data sets in the industry. More than 1 million users trust Medidata's seamless, end-to-end platform to improve patient experiences, accelerate clinical breakthroughs, and bring therapies to market faster. Discover more at www.medidata.com. Medidata is looking for individuals who will help us tackle some of the most complex questions facing the industry today using our proprietary platform and advanced analytics. At Medidata, we never work alone. This role will partner heavily with all of the key stakeholder functions including product, delivery, data science, engineering, partnerships, and biostatistics. Successful Medidata AI candidates will be skilled in analytical/quantitative thinking, structured communication, and excited about building the next horizon of Medidata's mission to power smarter treatments and healthier people. You will be reporting to Director, Data Engineering.

Requirements

  • Bachelor's degree in a technical or scientific field, such as Statistics, Data Science, Computer Science, or similar
  • 7+ years of experience in roles such as Data Scientist or Data Engineer with a strong foundation in Enterprise Data Architecture and Engineering
  • Hands-on experience with tools and concepts such as Airflow, CDC, batch processing, and job scheduling.
  • Hands-on experience data curation, cleansing, and annotation to support model fine-tuning and evaluation workflows.
  • Experienced in building scalable, cloud-native data pipelines using tools and services like Streamlit, Snowflake and containerization platforms like Docker/Kubernetes.
  • Proficient in Git/GitHub, GitHub Actions for CI/CD, and managing infrastructure as code using Terraform
  • Hands-on experience building high-throughput data pipelines across cloud platforms and MCP server environments.
  • Proficient in implementing RAG architectures, vector databases, and low-latency retrieval layers.
  • Skilled at integrating AI/ML pipelines into production-grade data infrastructure.

Nice To Haves

  • Experience with clinical trial data is not required, but interest to learn and understand how these data improve medical research is paramount

Responsibilities

  • Apply advanced skills in data architecture, data science engineering, data modeling, and data quality using modern cloud-native technologies.
  • Develop ETL pipelines, working with vector databases, automation, and CI/CD using tools such as Python, SQL, and Git.
  • Develop LLM applications using Retrieval-Augmented Generation (RAG) and support fine-tuning for domain-specific tasks.
  • Analyze and manipulate both structured and unstructured data sources, ensuring high data quality and readiness for downstream consumers.
  • Document and communicate technical work clearly to stakeholders at all levels, both technical and non-technical.
  • Collaborate effectively in Agile environments and cross-functional teams, building secure, scalable data pipelines into Snowflake from both on-premise and cloud-based sources.

Benefits

  • Medidata believes that benefits should connect you to the support you need when it matters most and provides best-in-class benefits, including medical, dental, life and disability insurance; 401(k) matching; flexible paid time off; and 10 paid holidays per year.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service