Principal Data Engineer

Octus•New York, NY

About The Position

Octus is seeking a Principal Data Engineer to roll up their sleeves and build scalable, production-grade data pipelines and infrastructure. You'll be a hands-on technical leader — writing code daily, solving hard engineering problems, and helping elevate the team around you through doing. Spanning Snowflake, Databricks, and AWS, you'll be deeply involved in the day-to-day development of the data platform that powers Octus's products, data, and automation initiatives. The ideal candidate is an expert in Python and SQL who thrives in an execution-focused environment and has deep experience building modern data pipelines and lakehouse solutions.

Requirements

Strong foundation in software engineering principles, including SOLID design, modularity, and scalability.
Expert proficiency in Databricks, including Delta Lake, Unity Catalog, Delta Live Tables, MLflow, and Databricks Workflows.
Deep experience with Snowflake, including data modeling, performance optimization, and integration with upstream/downstream pipeline tooling.
Expert proficiency in Python for data pipeline and automation development.
Advanced SQL skills with experience optimizing complex queries and data models at scale.
Proven experience designing and maintaining cloud-native data pipelines on AWS (e.g., MWAA/Airflow, Lambda, ECS, SQS, Glue, S3, Redshift, etc.).
Experience implementing and managing Terraform or similar IaC frameworks.
Strong understanding of lakehouse architecture patterns, data ingestion, transformation, and orchestration, including familiarity with ML/AI pipeline integration patterns.
Familiarity with CI/CD pipelines, automated testing, and modern DevOps practices.
8+ years of experience in data engineering or backend development, with a focus on scalable data solutions.
Demonstrated experience leading data infrastructure projects end-to-end and mentoring senior engineers.
Familiarity with containerization (Docker) and workflow orchestration best practices.
Excellent communication, collaboration, and problem-solving skills.

Nice To Haves

Experience with streaming data technologies (Kafka, Kinesis, Flink).
Exposure to ML/AI pipeline patterns (feature stores, experiment tracking, model serving) and MLOps tooling, particularly in a cross-functional team environment.
Experience integrating data quality and observability tools.
Experience with Databricks as a data sharing and collaboration platform (Delta Sharing, Marketplace).
Familiarity with Claude Code or similar AI-powered developer tools for accelerating pipeline development and code workflows.

Responsibilities

Build and maintain end-to-end data pipelines — from raw ingestion through transformation and delivery — across diverse data sources (APIs, web data, internal feeds, etc.).
Hands-on development of scalable, production-grade pipelines within Databricks, including Delta Lake table management, Workflows, and cluster optimization.
Build and maintain data models, schemas, and transformation logic in Snowflake, optimizing for performance and reliability.
Develop and manage Databricks environments including Unity Catalog, Delta Live Tables, and integration patterns that support both internal data consumers and external sharing use cases.
Build and manage orchestration workflows using AWS services (MWAA/Airflow, Lambda, ECS, SQS, MSK) and Databricks-native orchestration where appropriate.
Implement and maintain infrastructure as code (IaC) using Terraform, ensuring reproducibility and compliance with cloud standards.
Establish and enforce best practices in data modeling, schema design, and ETL/ELT processes for high-volume structured and semi-structured data across Snowflake and Databricks.
Ensure data quality, lineage, and observability through automated testing, monitoring, and alerting across all pipeline layers.
Collaborate closely with technology leadership to align data platform development with business strategy and product goals.
Stay at the forefront of industry trends in data engineering, lakehouse architecture, and cloud-native data platforms.