Lead Engineer Network Operations Center

PerficientCharlotte, NC
Onsite

About The Position

We are seeking a Lead Engineer, Network Operations Center (NOC) to join a large Managed Services program supporting enterprise clients. In this role, you will serve as a hands-on technical leader responsible for real-time (“eyes-on-glass”) monitoring, enhancing monitoring and alerting quality, accelerating incident triage and restoration, and strengthening operational readiness across critical platforms. You will leverage industry leading observability tools, such as Dynatrace to build actionable dashboards and high-fidelity alerts. Additionally, you will partner closely with Site Reliability Engineering (SRE), Observability, Automation, and application/infrastructure teams to prevent incidents, reduce alert noise, and improve service resilience across AWS and Windows/Linux production environments, following ITSM best practices. This is an exciting opportunity to deliver measurable improvements in system stability and user experience through advanced observability and incident response engineering. If you are passionate about production operations, reliability, and building pragmatic solutions at scale, we encourage you to apply and help raise the bar for resilience and operational efficiency. Perficient is always looking for the best and brightest talent and we need you! We’re a quickly-growing, global digital consulting leader, and we’re transforming the world’s largest enterprises and biggest brands. You’ll work with the latest technologies, expand your skills, and become a part of our global community of talented, diverse, and knowledgeable colleagues.

Requirements

  • 5+ years of progressive experience in Production Services, SRE/Operations, NOC/Command Center, or related reliability/operations engineering roles.
  • Working knowledge of ITSM principles (incident, problem, and change management) and experience operating within an enterprise incident management process and tooling.
  • Strong hands-on experience building and operating Dynatrace dashboards, alerts, and diagnostics to support eyes-on-glass monitoring and rapid troubleshooting in production.
  • Strong troubleshooting skills in Windows and Linux server environments (services, performance, logs, networking fundamentals).
  • Operational experience supporting workloads in AWS (e.g., EC2, ALB/NLB, RDS, CloudWatch/integrations, IAM basics) and understanding cloud failure modes.
  • Comfort operating in a fast-paced, high-severity incident environment—able to prioritize, stay calm, and communicate clearly under pressure.
  • Experience creating and maintaining runbooks/playbooks and using them during real incidents; comfortable leading technical triage under time pressure.
  • Strong understanding of key technology components and architectural principles across cloud, databases, networking, systems, and applications.
  • Demonstrated troubleshooting and systems thinking skills—able to isolate failures, validate hypotheses, and drive to resolution.
  • Excellent communication and interpersonal skills, with the ability to collaborate effectively across engineering and leadership teams.
  • Demonstrated ability to leverage AI tools to enhance productivity, streamline workflows, and support data-informed task execution.
  • A solid understanding of AI capabilities and limitations including ethical considerations is expected.
  • Ability to influence without authority and drive changes that improve reliability and operational outcomes.
  • Analytical mindset with the ability to translate operational data into actionable insights and prioritized improvements.
  • Demonstrated success collaborating with globally distributed teams in complex enterprise environments.
  • Strong client-facing or consulting background, with experience driving outcomes in customer-facing engagements.

Nice To Haves

  • Familiarity with AI-enhanced platforms is a plus.
  • Financial services or FinTech experience would be considered a plus

Responsibilities

  • Serve as a senior technical escalation point during incident lead deep-dive triage, coordinate technical containment, and drive restoration activities with domain teams.
  • Improve signal-to-noise by tuning Dynatrace alerts, defining actionable thresholds, and implementing routing/deduplication so responders receive the right alerts at the right time.
  • Author and maintain operational runbooks and incident playbooks in partnership with service owners; ensure they are accurate, testable, and used in practice.
  • Build and enhance Dynatrace dashboards, eyes-on-glass views, and diagnostics (logs/metrics/traces) to shorten time-to-detect and time-to-diagnose for critical services.
  • Troubleshoot production issues across AWS and Windows/Linux environments (compute, networking, storage, OS/application services) and engage the right domain teams with evidence-based hypotheses.
  • Contribute to and/or lead technical root cause analysis for significant or repeat incidents; ensure learnings translate into durable fixes and prevention actions.
  • Analyze incident and alert trends to surface systemic risks, recurring failure modes, and prioritized reliability improvements.
  • Provide clear, timely technical updates during incidents and post-incident reviews; communicate impact, progress, risks, and next steps.
  • Support operational readiness reviews for new services and major changes (monitoring coverage, SLOs/SLIs, runbooks, rollback plans).
  • Mentor engineers and analysts on troubleshooting approaches, observability practices, and incident response fundamentals.

Benefits

  • Information regarding the benefits available for this position are in our benefits overview.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service