Software Development Engineer

Apple•New York, NY

8h•Onsite

About The Position

Design, develop, and maintain large-scale distributed systems supporting content ingestion and delivery for digital media products. Build automation and observability frameworks to ensure reliability, scalability, and performance of production services. Collaborate with software engineering teams to optimize application architecture, including traffic routing, load balancing, caching, and other industry standards. Automate deployment pipelines, design telemetry for real-time visibility into service health, and perform root-cause analysis to enhance reliability and operational efficiency. Partner with product and infrastructure teams to define service-level objectives (SLOs), drive incident response improvements, and evolve the reliability roadmap for mission-critical systems.

Requirements

Utilizing programming languages including Python, Java, Golang, or C++ to design and implement backend systems and automation tools supporting large-scale content ingestion pipelines.
Utilizing Kubernetes and containerization technologies, including Docker, to design, deploy, and operate microservice-based distributed systems in production environments.
Configuring network gateways, ingress controllers, and load balancers (including NGINX or Envoy) to manage service traffic and optimize application entry points.
Utilizing continuous integration and delivery (CI/CD) tools, including Jenkins or GitHub Actions, to automate software builds, testing, and multi-environment deployments.
Designing and implementing telemetry and observability systems using Prometheus and PromQL to collect metrics, and Grafana to visualize performance trends and service reliability dashboards.
Designing and implementing distributed data processing systems using cloud platforms such as AWS or Google Cloud Platform (GCP).
Utilizing structured and unstructured data stores, including PostgreSQL and MongoDB, to support ingestion, transformation, and retrieval of high-volume datasets.
Implementing infrastructure as code (IaC) using Terraform or Pulumi to provision and manage cloud infrastructure resources.
Conducting root-cause analysis and post-incident reviews to identify reliability risks and drive service-level improvements.
Applying site reliability engineering principles, including capacity planning, error budgets, and system health monitoring, to maintain service uptime and performance.

Responsibilities

Design, develop, and maintain large-scale distributed systems supporting content ingestion and delivery for digital media products.
Build automation and observability frameworks to ensure reliability, scalability, and performance of production services.
Collaborate with software engineering teams to optimize application architecture, including traffic routing, load balancing, caching, and other industry standards.
Automate deployment pipelines, design telemetry for real-time visibility into service health, and perform root-cause analysis to enhance reliability and operational efficiency.
Partner with product and infrastructure teams to define service-level objectives (SLOs), drive incident response improvements, and evolve the reliability roadmap for mission-critical systems.