Staff Data Engineer

LVT•Seattle, WA

About The Position

LVT is redefining how businesses operate in the physical world, moving beyond traditional security solutions to deliver AI-driven, actionable intelligence that makes sites smarter, safer, and more secure. Since pioneering our first mobile, solar-powered units, our commitment to scrappy, hands-on innovation has made us an established leader and one of the fastest-growing companies in intelligent site technology. We are building the next generation of solutions—from our physical units in the field to a powerful Agentic AI platform—that allows our customers to gain unprecedented visibility and control over safety, compliance, and operations. This is your chance to join a cutting-edge team that isn't just watching the world change, but actively building the technology that is changing it. We’re a team that’s focused on growth and innovation, and we’re proud that our crew, products, and leadership are being recognized for it. A Top-Tier Growth Company: Named one of the Financial Times’ Fastest Growing Companies 2025 and #10 on the Inc. 5000 Rocky Mountain Regional list for 2025. Innovative Leadership: Our CEO, Ryan Porter, was named an EY Entrepreneur of the Year 2025, and our CTO, Steve Lindsey, was inducted into the Silicon Slopes CTO Hall of Fame in 2024. Product & Software Excellence: We were named one of The Software Report’s Top 100 Software Companies of 2023 and are a winner of the Security Today Govies Award for 2025. LVT's AI systems are only as good as the data behind them. As we move toward Physical AI, the binding constraint shifts from model architecture to the data flywheel. We are seeking a Staff Data Engineer to own that flywheel end to end including logs, sensor telemetry, labels and annotations, evaluation and benchmark sets. Every AI team trains and evaluates from a single stack that transforms data from the raw source through standardized, versioned, governed datasets. This is a senior individual-contributor and technical-leadership role; formal people management is not required. You will partner closely with AI/ML research, the ML platform / MLOps function. You own the data side of the contract that defines what a model consumes and emits and annotation, edge, and infrastructure teams. You should be equally comfortable discussing dataset schema design, storage and partitioning trade-offs for multimodal data, versioning and migration strategy, and the governance controls that keep sensitive video and sensor data safe.

Requirements

8+ years building and operating large-scale data pipelines and data-lake or lakehouse systems in production ingestion, ETL/ELT, partitioning and storage-format decisions, and the reader/writer libraries consumers rely on.
Has built data pipelines for model training and evaluation, labeled data, and evaluation/benchmark sets with a working understanding of how data quality and versioning move model results.
Strong experience with medallion-style layered data architectures and modern table/lake formats (e.g. Iceberg, Delta, Parquet, or comparable), including schema evolution and dataset versioning.
Experience with large multimodal data video, image, sensor/telemetry and the storage and access patterns that make it queryable at scale (denesting, repartitioning, binary-inline vs. reference storage).
Hands-on with the data side of ML frameworks PyTorch/Lightning dataloaders and Spark and strong Python knowledge.
Practical experience enforcing data governance in pipelines classification, access control, lineage and provenance, retention, particularly for privacy sensitive data.
A track record of setting data-engineering direction and leveling up engineers (technical leadership; formal management not required).
Bachelor's or Master's in Computer Science, Engineering, or a related field, or equivalent practical experience.

Nice To Haves

Streaming or near-real-time ingestion from edge/IoT sources into a data lake (e.g. Kafka, Lambda, EMR, or similar).
Append-without-rewrite and hash-indexed dataset techniques on open table formats, and dataset/feature-versioning systems.
Generative-AI data work: fine-tuning and evaluation dataset curation for LLMs/VLMs.
Exposing datasets to AI agents through MCP-style query interfaces, with semantic schema and plain-language documentation for retrieval.
Computer-vision / video annotation tooling and workflows (e.g. Encord, Labelbox, or similar).

Responsibilities

Own the end-to-end loop that converts raw edge telemetry and video into labeled training data, frozen evaluation sets and feeds model outputs back into the next round.
Build and own the pipelines that register raw source data, standardize it into a single well-defined schema, and join and aggregate it into curated datasets so every team trains, validates, and benchmarks from one consistent store through one reader, rather than copying and reformatting data per use case.
Own how labels and semantic annotations are appended to datasets without rewriting source data, then versioned, quality-checked, and served, partnering with annotation and data-operations teams on label production and verification while you own the dataset, storage, and serving side.
Own the frozen, versioned validation and benchmark datasets that make model comparisons valid over time stable enough that an accuracy delta reflects the model, not a shifting dataset including the review and scrubbing discipline required before any set is shared externally.
Own schema and content versioning so producers can evolve datasets without breaking consumers opt-in versions, append-without-rewrite for new fields, and the reader/writer indirection that lets data migrate underneath clients on a controlled rollout instead of forced lockstep migrations.
Own the read/write libraries and integrations researchers depend on PyTorch/Lightning dataloaders, a simple record-level CRUDL API, and Spark/analytics access and self-service so AI teams stay focused on model development.
Make governance machine-enforced in the flywheel rather than documented after the fact classification of clips, frames, labels, and embeddings; scrubbing and anonymization in load jobs; and lineage and provenance for every dataset version, annotation campaign, and training input.
Set the data-engineering standards for the flywheel schema conventions, dataset contracts, quality gates and mentor IC work toward them, growing the function as the team forms.