NOC Engineer – Tier 2 (Managed Services / MSP)

Architecture in Motion Inc.

14d•Remote

About The Position

We are seeking a highly skilled and proactive Tier 2 NOC Engineer to support and operate a multi-tenant Managed Services environment across diverse customer verticals. This role is responsible for incident response, advanced troubleshooting, performance optimization, and proactive monitoring across infrastructure, systems, and applications. You will work in a fast-paced NOC environment leveraging tools such as Zabbix, Site24x7, Ansible, and modern observability practices to ensure high availability, reliability, and performance of customer environments.

Requirements

3+ years of experience in a NOC, MSP, or IT operations environment.
Experience working in a 24/7 NOC or MSP environment is required.
Strong experience with monitoring tools such as Zabbix (or similar).
Solid expertise in Windows Server (Active Directory, DNS, DHCP) and Linux administration (CLI, networking, logs).
Experience with database troubleshooting and networking fundamentals (TCP/IP, routing, firewalls).
Familiarity with cloud platforms (Azure preferred) and virtualization technologies.
Strong troubleshooting and analytical thinking.
Ability to handle high-pressure incidents and manage multiple concurrent tickets.
Understanding of ITIL practices including Incident, Problem, and Change Management.
Hands-on experience with at least one of the following: Python, PowerShell, or Bash.

Nice To Haves

Experience integrating tools via APIs is a strong plus.
Experience with container platforms (Docker, Kubernetes), CI/CD pipelines, and Infrastructure as Code (Terraform).
Exposure to security tools or SOC environments and log management platforms such as Elastic or OpenSearch.

Responsibilities

Act as the Tier 2 escalation point for incidents from Tier 1 NOC.
Perform deep-dive troubleshooting across Windows Servers (AD, DNS, GPO, IIS), Linux systems, and databases (MySQL, PostgreSQL, MSSQL).
Own incidents end-to-end from identification through resolution and root cause analysis (RCA).
Participate in major incident response (P1/P2) and post-incident reviews.
Configure and optimize monitoring using tools such as Zabbix, including templates, triggers, thresholds, and discovery rules.
Reduce alert fatigue by tuning thresholds, eliminating noise, and improving signal quality.
Implement proactive monitoring strategies including capacity planning, trend analysis, and predictive alerting.
Manage and troubleshoot Windows Server environments, Linux distributions (Ubuntu, RHEL, CentOS), virtualization platforms (VMware, Hyper-V, KVM), and cloud environments (Azure, AWS, GCP).
Perform system patching, updates, performance tuning, and service hardening.
Monitor and troubleshoot MySQL, PostgreSQL, and MSSQL databases.
Handle performance issues such as slow queries, locks, and indexing.
Support backup and restore validation, as well as replication and availability issues.
Build and maintain automation scripts using Bash, PowerShell, or Python.
Improve operational efficiency through runbooks, self-healing scripts, and API integrations.
Collaborate with engineering teams on Infrastructure as Code (Terraform, Ansible).
Support multiple customers across industries including finance, healthcare, SaaS, and manufacturing.
Ensure SLA compliance and proper ticket handling and prioritization using tools such as Jira or ServiceNow.
Maintain runbooks, SOPs, and troubleshooting guides.
Contribute to knowledge base and internal training initiatives.
Mentor Tier 1 engineers.