Data Engineer

Minerva•New York, NY

1d•Onsite

About The Position

Minerva is building the next-gen demand generation platform for consumer brands, providing a centralized view of the consumer and tooling to simplify existing sales, marketing, and customer success infrastructure. Leveraging proprietary consumer data and ontology, Minerva creates an observable, holistic consumer profile over time, using decisioning models to power optimal activation throughout the funnel, driving revenue and commercial efficiency for brands. Minerva creates high ROI outcomes for high-equity brands across industries (e.g., Ramp, JuiceBox, NBA, Miami Dolphins, Wander, Luxury Presence, Trust & Will, Base Power, Shef) and has large-scale distribution agreements with partners such as Experian and Clay. The company was founded by successful quantitative researchers from Citadel and Bridgewater, with products and models built by individuals from MIT, Stanford, Berkeley, Cambridge, Uber, Samsung Research, Third Point, and Lazard. The platform offers enrichment and person search/segmentation tools, first-party & third-party data unification, audience generation, marketing creative analysis, and more, with AI Agents orchestrating most features. Minerva helps businesses better understand their existing customers and find net-new customers by providing ground truth data, modeled attributes/behaviors, and custom-built targeting models. At Minerva, the company is data-first, seeking a Data Engineer ready to evolve and build from scratch to enhance their data product of 260M+ consumers. This role offers unlimited impact on the business, working closely with a team of engineers.

Requirements

2-4+ years working as a data engineer or software engineer in a data-heavy context
Highly proficient at Python and SQL
Strong intuition of data engineering principles, especially those around data cleaning/ingestion and data modeling.
Willingness to work in office in NYC (we provide a relocation package)
Flexibility and open to wearing several hats
Eagerness to learn and grow with the company and your coworkers

Nice To Haves

Exposure to different data sources (e.g. APIs, S3, SFTP, WWW, etc.)
Experience with Lakehouse architectures & comfort handling XX+ TB datasets
Experience working with orchestration tools like DBT, SQLMesh and Airflow
Experience working with both transactional databases (e.g. Postgres, MySQL) and analytical databases (e.g. Snowflake, Redshift), with a bias towards the latter.
Familiarity to backend & ML/AI engineering is a plus
Experience working with AI coding tools, e.g. Cursor, Claude Code, OpenCode
Prior work at a startup

Responsibilities

Architect and build scalable & robust distributed infrastructure: enabling Spark, Lakehouse, ML, etc. - across a variety of datasets ranging from website visit feeds, professional & property data, identity graph and more
Improve our existing orchestration architecture: we’ve outgrown our SQLMesh/DBT transformation infra and are looking to expand our footprint into higher throughput / scalable solutions.
Data movement: keeping our Snowflake warehouse & Lakehouse in sync with our Postgres, Elasticsearch and other product backend destinations
Build systems to expose our data product to internal and external AI agents (think: MCP, vector DB)
Innovate on how we extract value out of 100+ TB datasets in an unsupervised manner
Innovate on how we can systematize our pipelines within an ever-growing transformation layer
Innovate on how we build the right access patterns to leverage data and meet our modeling and product goals
Innovate on how we can directly monetize our data
Enable new ways to realize the value of Minerva’s data and ensuring our existing revenue streams stay in good standing.
Learn more about our modeling processes and assist the Data Science team with feature engineering & ML training infrastructure
Productionize a multi-faceted pipeline of web traffic data touching LLMs, ML models, Identity Resolution, and Vector Search
Orchestrate large-scale data gathering of publicly available data in a near real-time fashion via web scraping
Build a field- and row-level lineage graph across our data platform, then use it to power a reactive propagation system — when upstream data changes, affected downstream transformations are automatically identified and re-executed in real-time.