Market Data Engineer

BHFT

1d•Remote

About The Position

The Data Engineering team is responsible for designing, building, and maintaining the Market Data Platform — a lakehouse infrastructure spanning the full path from raw exchange feeds to reliable, petabyte-scale data for research, backtesting, and real-time trading.

Requirements

5+ years building production-grade data systems, with proven expertise architecting and launching data lakes / lakehouses from scratch.
Hands-on experience with Apache Iceberg (or comparable table formats — Delta / Hudi): partitioning, schema evolution, snapshots, compaction, and catalog operations; familiarity with Apache Arrow for zero-copy, columnar in-memory interchange.
Experience with market data and/or network packet capture — decoding pcap, exchange feed protocols (ITCH, FIX/FAST, multicast UDP), order-book reconstruction, and time-series at scale (strong plus; willingness to learn required).
Experience normalizing market data from multiple vendors — e.g. OneTick, Refinitiv/Reuters, Bloomberg, ICE — into a unified schema and symbology (strong plus).
Expert-level Python (incl. Polars and/or PySpark); Rust a strong plus (relevant for high-performance capture/decoding).
Modern orchestration (Airflow) and distributed processing (Apache Spark).
Advanced SQL: complex aggregations, window functions, query optimization, partition pruning.
Solid fundamentals in Linux, containerization (Docker, Kubernetes / EKS), and cloud object storage (AWS S3).
DevOps & observability: CI/CD, infrastructure-as-code (Terraform), GitOps (ArgoCD), and metrics/dashboards/alerting (Grafana, Prometheus).
Strong grasp of structured + unstructured / binary data, and storage optimization — partitioning, compression, cost management.
English fluency for documentation and collaboration in an international team.

Responsibilities

Own the full capture path from wire to lake: decode and normalize raw exchange feeds (pcap, multicast UDP / ITCH / FIX) and vendor sources (OneTick, Refinitiv, Bloomberg, ICE) into a unified canonical model with nanosecond timestamps.
Build batch + stream pipelines (Airflow, Spark, dbt) for tick and reference data.
Own L2/L3 order-book reconstruction with gap handling.
Provide Python and Rust producer SDKs for internal feed handlers.
Own the Iceberg-over-S3 lakehouse: design partitioning, sort orders, and row-group layout for fast scans; manage schema evolution, snapshots, time travel, compaction, and TTL.
Maintain reference data as slowly-changing tables with point-in-time correctness for backtests.
Drive storage cost optimisation via compaction, tiering, and snapshot expiry.
Build libraries for schema management, data contracts, validation, and lineage on top of the Iceberg catalog.
Develop shared access services (Spark + Polars) so Research, backtesting, and trading share one normalized data layer, including gap detection and pcap-vs-lake reconciliation.
Embed monitoring, alerting, SLAs/SLOs, and CI/CD across capture and pipeline layers on Kubernetes (EKS).
Own data-quality dashboards and incident runbooks for the capture fleet.
Partner with Quant Research, Data Science, Backend, and DevOps to translate requirements into platform capabilities and champion market-data engineering best practices.