ML Engineer, Network Intelligence

Colt Technology Services

34d•Hybrid

About The Position

Colt Technology Services is a global digital infrastructure company. They are building a Boulder-based AI team as the engineering center of their AI Practice. This team is responsible for how AI is adopted, governed, and scaled across Colt. The AI Practice operates across several parallel workstreams, including use case delivery, AI WAN, and private AI. The Boulder team is where they build, test, and validate before they scale. AI WAN is Colt's flagship AI product - a software-defined network built to carry AI traffic reliably and at scale. On top of that, they use AI itself to optimize how the network operates: predicting congestion, automating configuration, and ultimately enabling closed-loop autonomous network management. This role is grounded in real network infrastructure, and the Boulder AI Hub is where the intelligence layer gets built. As ML Engineer, Network Intelligence, you will own the data and modeling layer that turns Colt’s network telemetry into production ML systems. Your primary work will involve cleaning and structuring real-world network data, building classical ML models for anomaly detection, predictive maintenance, and traffic forecasting, and deploying those models into production. You will work directly with network telemetry data in GCP BigQuery, build the data pipeline and feature engineering layer, and develop classical ML models for anomaly detection, predictive maintenance, traffic forecasting, and capacity planning. Longer term, you will be central to the AI WAN closed-loop architecture, defining what network state the model consumes and what control plane actions it is safe to initiate. You do not need to be a deep learning researcher or a network engineer, but you need enough network domain knowledge to make sense of the data, and strong ML fundamentals to build models that are trustworthy in a production environment. MLOps tooling experience is a plus but not a day-one requirement. You will have the opportunity to grow into ownership of the model lifecycle layer as the team matures. Network AI is one of the most high-value applications of ML in enterprise technology, and one of the least staffed with people who understand both sides. This role offers direct access to one of the world’s largest fiber networks, a small team where your work has outsized impact, and a product (AI WAN) that is Colt’s most significant long-term revenue opportunity.

Requirements

Classical ML: Time series analysis, anomaly detection, supervised/unsupervised learning; scikit-learn, XGBoost, PyTorch or equivalent; model evaluation and production deployment experience
Data Engineering: SQL and BigQuery; data pipeline construction; feature engineering from raw telemetry; experience with real-world network data
Cloud & Tooling: GCP (BigQuery, Vertex AI, Cloud Storage); Python; MLOps lifecycle tooling (MLflow, Weights & Biases, Vertex AI Pipelines or equivalent) is a growth expectation. Experience is a plus, ownership is where you are headed in six months
Mindset: Comfortable working with ambiguous, incomplete data; understands that network operation requires high trust thresholds before autonomous action; can translate between network engineering and ML concepts

Nice To Haves

Experience with Cisco platforms, NSO, Itential, or similar network orchestration tools
Streaming telemetry (Kafka, Pub/Sub)
OpenTelemetry
Familiarity with network operations or network telemetry data is a plus
SDN experience is a significant advantage for the AI WAN closed-loop work but is not required for Phase 1 delivery

Responsibilities

Build and operationalize ML models for anomaly detection on network time series data
Own root cause analysis (RCA) model development, identifying contributing factors and failure chains in network events, in addition to detecting that something is wrong
Develop predictive maintenance models to forecast hardware failures and network degradation before customer impact
Build traffic forecasting and capacity planning models to support proactive network management
Design model evaluation frameworks appropriate for network operations - precision/recall tradeoffs, false positive costs, operational trust-building
Assess, clean, and structure network telemetry data in GCP BigQuery - the foundational step before any ML is possible
Build data pipelines that transform raw network telemetry into ML-ready features
Work with Colt's NaaS and network operations teams to understand data semantics, quality gaps, and labeling challenges
Define the data access and enrichment roadmap for network AI use cases
Own the full lifecycle of network ML models: experiment tracking, model versioning, retraining pipelines, and production drift monitoring
Define retraining triggers and model health thresholds appropriate for network operations, where a degraded model can have real service impact
Partner with the AI Platform Engineer, who owns the underlying infrastructure; you own the ML layer on top. The boundary is model serving (yours) versus Kubernetes and GPU infrastructure down (theirs)
Work with Cisco and Colt's NaaS team to understand what network state data are available and what control plane APIs exist for programmatic network actions
Define the closed-loop architecture: what inputs feed the model, what decisions it can make autonomously, what requires human confirmation
Build the initial recommendation layer (human-in-the-loop) before progressing to autonomous closed-loop actions
Design guardrails, rollback mechanisms, and confidence thresholds appropriate for production network control
Partner with the Staff AI Engineer to connect ML model outputs to agent orchestration and recommendation systems
Work with the AI Platform Engineer on the handoff boundary: they own Kubernetes, GPU infrastructure, and model serving setup; you own what runs on top, including experiment tracking, retraining pipelines, and production model health
Engage directly with NaaS team and network operations stakeholders to ground use cases in real operational problems