Data Engineer

City of New York•New York City, NY

22h•Hybrid

About The Position

Our Data Analytics Unit, embedded in the Policy & Community Affairs Division, works at the intersection of data infrastructure and public policy. A lot of initial work starts with a policy question rather than a specification. The data (and analytics) engineering focus is on building and maintaining the infrastructure that makes good policy analysis possible - everything from inspecting raw file submissions and initial ETL to designing pipelines and creating silver and gold tables to make analysis as streamlined as possible. Our goal is to make sure what we're producing is trustworthy: well-documented, reproducible, and reliable over time. We're a small team of analysts and engineers, so there is room for close collaboration. The data itself is rich: billions of trip records, GPS breadcrumb traces for every for-hire vehicle in the city, detailed session data across all major platforms. The infrastructure we've built - Databricks, Delta Lake, Azure - allows us to process this quickly and consistently at scale so we can focus on policy impact and not be bottlenecked by compute.

Requirements

Master's degree from an accredited college or university with a specialization in an appropriate field of physical, biological, environmental or social science.
At least three years of responsible full-time research experience in the appropriate field of specialization.
US work authorization (no visa sponsorship available).
Familiarity processing large data streams.
Clean and readable Python and SQL.
Comfortable with ambiguity.
Understanding of LLMs but avoiding copy-pasting without understanding.

Nice To Haves

Analytical curiosity.
Interest in data quality.
Understanding of what analysts and scientists are trying to do with data.
Habit of using version control, modular code, and documentation.

Responsibilities

Building and maintaining the infrastructure that makes good policy analysis possible.
Inspecting raw file submissions and initial ETL.
Designing pipelines.
Creating silver and gold tables to make analysis as streamlined as possible.
Ensuring that produced data is trustworthy, well-documented, reproducible, and reliable over time.
Processing large data streams.
Resolving data quality issues by reaching out to data providers and explaining upstream issues.
Taking vague requests and making progress without waiting for a perfect spec.
Writing clean and readable Python and SQL code.
Utilizing version control, modular code, and documentation.
Presenting to senior staff and external stakeholders.