Machine Learning Engineer (GoLang)

Comcast•Washington, DC

About The Position

Multimodal Analysis Framework (MAF) is an end‑to-end platform designed to process diverse content sources—including video, images, audio, and documents—to generate rich, structured metadata. The platform unifies multiple ML/AI models to extract curated insights at scale, tailored to specific business needs. MAF supports both on‑demand workloads (batch uploads, ad‑hoc analysis) and real‑time streaming workflows, enabling continuous metadata generation for live content streams. Customers can define their metadata requirements—such as entity extraction, scene segmentation, object detection, transcription, summarization, or multimodal correlation—and the framework orchestrates the appropriate models and toolchains to deliver high‑quality outputs. Through flexible APIs and UI‑based workflows, customers and internal teams can visualize metadata, trigger enrichment, monitor processing, and integrate results into downstream applications. The platform emphasizes modularity, scalability, and extensibility to support new ML models, LLM‑based agents, and cross-modal inference as use cases evolve. We are looking for a mid-level Backend Engineer to join our Machine Learning Platform team. This role focuses on building scalable backend systems that power ML workloads, including video, image, and document processing, and enable LLM-driven applications through agents and MCP servers. You will work primarily in Golang, deploy and operate services on Kubernetes, manage infrastructure with Terraform, and build on AWS. A core part of the role is designing platform capabilities that allow LLMs to safely and reliably interact with tools, data, and services via agent frameworks and MCP servers.

Requirements

3–6 years of professional software engineering experience.
Strong backend engineering experience with Golang.
Experience building and operating APIs (REST and/or gRPC) in production.
Hands-on experience with Kubernetes in production environments.
Experience using Terraform for infrastructure provisioning and deployment.
Solid working knowledge of AWS cloud services and core architectural concepts.
Experience building or supporting ML processing pipelines (video, image, or document).
Practical experience using LLMs in production systems.
Experience developing agents and/or MCP servers, or equivalent tool-integration platforms.

Nice To Haves

Experience with Milvus or other vector databases in production.
Familiarity with GPU-backed workloads and ML inference optimization.
Experience with messaging/streaming systems (Kafka, SQS, SNS, etc.).
Knowledge of secure system design for AI platforms (IAM, secrets management, least-privilege access).
Experience working on internal developer platforms or ML infrastructure teams.

Responsibilities

Backend Engineering (Golang)
Design, build, and maintain high-performance backend services in Golang for ML and AI platform use cases.
Develop REST and gRPC APIs for inference, processing pipelines, orchestration, and platform services.
Implement asynchronous and distributed processing patterns (workers, queues, event-driven systems).
Ensure backend services meet production standards for scalability, reliability, and security.
ML Platform & Processing Pipelines
Build and operate backend systems supporting:
Video processing (frame extraction, metadata generation, embeddings, indexing).
Image processing (OCR, classification, detection, embedding generation).
Document processing (parsing, layout analysis, chunking, OCR, retrieval pipelines).
Integrate ML inference services into backend workflows with attention to latency, throughput, and cost.
Work closely with ML engineers and data scientists to productionize models and pipelines.
LLMs, Agents, and MCP Servers
Build LLM-enabled backend services using structured prompting, tool/function calling, and retrieval-augmented generation (RAG).
Design and implement agentic workflows (multi-step reasoning, tool orchestration, retries, guardrails).
Develop and operate MCP servers that expose internal platform capabilities (search, retrieval, processing, data access) to LLM-based applications.
Enforce security, access control, and observability for agent and MCP interactions.
Vector Search & Retrieval
Design and maintain vector-based retrieval systems using Milvus.
Implement embedding ingestion, indexing, and query pipelines at scale.
Optimize retrieval quality, latency, and relevance for downstream LLM applications.
Cloud, Kubernetes & Infrastructure
Deploy and operate backend and ML services on Kubernetes (scaling, rollouts, resource management).
Use Terraform for infrastructure provisioning and continuous delivery of cloud resources.
Build and operate primarily on AWS, leveraging services such as:
Compute, networking, and IAM
Object storage
Managed Kubernetes
Logging and monitoring services
Reliability, Quality & Operations
Implement observability using logs, metrics, and traces; define SLOs and alerts.
Write automated tests (unit, integration) and contribute to CI/CD pipelines.
Participate in on-call rotations and incident response; drive post-incident improvements.