Senior VoIP Operations & Reliability Engineer (Carrier-Class Voice Platform)

Planet Networks•Fredon Township, NJ

15d

About The Position

Our software team is building a next-generation carrier-class voice platform. They are strong programmers, but they are not experienced operators, and there is a world of difference between code that works and infrastructure that stays up under real carrier load. We need a seasoned operator to close that gap and work hand in hand with the development team. You are the person who has actually run this kind of system in production. You know the failure modes that do not show up in a code review, the things that break at 2am, and what it really takes to keep customers from ever noticing. Your job is to bring that operational reality into the platform from the inside: pairing with the programmers as they build, making sure the design can be operated, and then owning the platform in production with zero customer downtime. There is an architectural side to this. You will sit in design reviews and push the team toward decisions that are operable, resilient, and testable, not just elegant in code. But the core of the role is operational: you are the experienced hand who keeps every system up, who owns every failure scenario end to end, and who instills operational discipline in a team that has not had to live with a pager before. You should be equally comfortable pairing with a developer to make a service observable and failure-aware, and at 3am driving an incident to resolution. We need that judgment, with years of real VoIP operations behind it. In the meantime, this is not a future-only role. We already run a live Kamailio and Asterisk production system carrying real customer traffic today, and your first and most immediate mandate is to help harden it: shore up its reliability, close its failure gaps, and keep it solid while the next-generation platform is being built. Day-to-day production stability of the current system comes first.

Requirements

Years of senior, hands-on experience operating and reliability-engineering production VoIP systems at carrier scale.
Deep, protocol-level command of SIP: dialogs, transactions, registration, NAT scenarios, SDP negotiation, forking, and the failure modes that surface only under load.
Expert-level Kamailio and/or OpenSIPS: routing logic, dispatcher and load balancing, registrar and usrloc, dialog and topology modules.
Expert-level Asterisk: PJSIP stack, dialplan, ARI/AMI, bridging and media handling, and its role as an application and media server behind a SIP proxy.
Media plane fluency: RTP, SRTP, RTSP, RTCP, codecs (G.711, G.729, Opus), transcoding, jitter, and the link between QoS marking (DSCP) and call quality.
A demonstrated track record of designing for and operating reliability, scalability, and fault tolerance in carrier-class environments (five-nines thinking, failure-domain isolation, blast-radius control).
Hands-on reliability engineering practice: SLOs and error budgets, incident command, postmortems, runbooks, and DR testing.

Nice To Haves

Performance and failure testing tooling: sipp for load and call modeling, fault injection and chaos tooling, and SIP troubleshooting with sngrep and Wireshark.
Observability depth with Homer/HEP, plus metrics and alerting stacks (for example Prometheus, Grafana, or equivalent).
Strong Linux operations and automation skills (Python, Lua, shell), and comfort with infrastructure-as-code and CI/CD pipelines.
RADIUS/Diameter integration for AAA, and experience with provisioning and subscriber management.
Fraud and security operations: detecting and stopping toll fraud, SIP scanning, and registration attacks.
Experience interconnecting with multiple upstream carriers and managing the routing and failover complexity that brings.
FreeSWITCH or other media servers as a complement to Asterisk.

Responsibilities

Harden the current production system (immediate priority)
Take ownership of the reliability of our live Kamailio and Asterisk production system from day one, while the next-generation platform is still in development.
Assess the current system end to end and find its weak points: single points of failure, brittle failover, missing redundancy, capacity headroom, and the failure scenarios it does not yet handle gracefully.
Close those gaps incrementally and safely, without disrupting live customer traffic: add redundancy and failover, tighten configuration, and remove fragility.
Add the observability the current system is missing so problems are caught before customers feel them, and stand up alerting, dashboards, and SIP capture against the live fleet.
Stabilize day-to-day operations: triage and resolve recurring issues, document the system as it actually runs, and write the runbooks that do not exist yet.
Work hand in hand with the development team
Pair with the programmers throughout development as the operational voice in the room: review designs, challenge assumptions, and find the failure modes that code reviews miss.
Make operability a build-time requirement, not an afterthought: push for the logging, metrics, health checks, graceful shutdown, retry behavior, and failure handling that the team needs to add for the platform to survive production.
Transfer operational knowledge to the team: help developers understand how their code behaves under load and failure, and raise the whole group's instinct for production reality.
Map the full failure surface of the platform (node failure, data-center loss, upstream carrier outage, registration storms, partial network partitions, resource exhaustion) and make sure every scenario has a defined, tested behavior.
Design and run a rigorous test program: functional, load, stress, soak, and failover testing, with realistic call models (concurrent calls, BHCA, registration churn).
Build fault-injection and chaos testing into the pipeline so failure handling is proven, not assumed.
Validate the high-availability and scalability design under real conditions: active-active and active-passive topologies, geographic redundancy, graceful degradation, automated failover with measured recovery times, and capacity limits.
Keep it up (day-to-day reliability engineering)
Own platform uptime as a daily responsibility, not a quarterly goal. Customers should experience no downtime.
Build and own the observability stack: SIP capture (HEP/Homer), CDR and quality pipelines, metrics, dashboards, and alerting that catches problems before customers do.
Define SLOs and SLIs for signaling, media, and registration, and hold the platform to them.
Run incident response: detect, triage, mitigate, and resolve, then drive blameless postmortems and make sure the same failure cannot recur.
Write and maintain runbooks, and lead disaster-recovery and failover drills so the team can execute under pressure.
Participate in (and help design) a sustainable on-call rotation.
Tune and operate the production fleet: Asterisk, Kamailio, OpenSIPS, and the supporting network layer, under live carrier traffic.