Incident Response Analyst II

Astreya•San Jose, CA

56d•Onsite

About The Position

The IRC (Incident Response Center) is the first layer of defense responsible for quick detection and incident response using various monitoring and automation tools, conducting thorough investigation of alerts, classification, and triage. The IRC Analyst is responsible for delivering operations within the IRC across all client data center sites globally. IRC analysts are expected to respond to all alarms/alerts set in the data center environment, including Infrastructure Management (DCIM), Server Automation Operations System (SAOS), CCTV, Access Control Systems (ACS), and Building Management Systems (BMS), providing deep understanding and intelligence of the criticality and impact of incidents to resolver groups.

Requirements

2+ years of experience in a NOC, command center, or similar 24/7 operations environment
Ability to quickly triage and prioritize multiple incidents based on risk
Knowledge of systems including IP Networks, DC Environment, and Server Health
Strong written and verbal communication skills
Works well under pressure and within deadlines
Excellent communication and collaboration abilities
Strong analytical and problem-solving skills
Ability to work independently and as part of a team
Familiarity with data protection laws such as GDPR
This is an on-site role at client facilities
Must be willing to work variable shifts, including nights, weekends, and holidays

Nice To Haves

Degree in Information Technology
Networking knowledge (IP, DNS, load balancing)
Experience with Grafana, ticketing systems, and DC infrastructure.
Certifications such as CompTIA Server+ or Schneider Electric DCCA
Experience with Lenel, Genetec, or Avigilon systems is a plus
Proficiency with programming/scripting tools

Responsibilities

Analysts are responsible for the full lifecycle of incident management, from detection through to resolution and root cause analysis (RCA). This includes acting as incident commanders, maintaining SLAs, documenting actions, and providing insights to support continuous improvement efforts across teams and systems.
Investigate, report, and respond to alerts, incident response (war room, remote bridges).
Respond to incidents and critical situations in a calm, problem-solving manner, and conduct in-depth investigation of alerts.
Be the first line of defense using monitoring and automation tools to conduct investigation, classification, and triage, all within prescribed SLAs.
Provide deep understanding and intelligence of incident criticality and impact to resolver groups.
Ensure detailed records of alarm handling activities, including actions taken and resolutions in ticketing tools; file incident reports.
Act as incident commander during major incidents.
Understand internal/external communication methods and stakeholder responsibilities.
Support program managers and facilitate project deliverables, improving operational and engineering initiatives.
Conduct root cause analysis (RCA) to determine recurring problems.
Use in-depth questioning and analysis to determine the underlying cause of incidents or problems (Who, What, Where, When, Why).
Perform duties in compliance with SOPs, MOPs, Runbooks, and Playbooks.
Continuously monitor alarm dashboards and systems.
Investigate and respond to alarms related to Network, Data Center Environment, Server Health, Facility Security, and Safety.
Identify and acknowledge incidents associated with alarms.
Assess incidents to determine their criticality and operational impact.
Engage resolver groups and escalate to higher tiers or management following established paths.
Maintain communication with teams, stakeholders, and incident responders.
Follow documented procedures to resolve incidents promptly and effectively.
Ensure accurate records of alarm handling and resolution activities in ticketing tools.
Comply with SOPs, MOPs, Runbooks, and Playbooks.
Monitor Everbridge Visual Command Center (VCC), InternationalSOS emails, and open-source tools for real-time incidents affecting ByteDance assets and travelers.
Monitor tools or queries for specific stakeholder requests.
Report on violence, severe weather, or threats to life, property, and assets.
Coordinate emergency responses, including with law enforcement if required.
Verify incident information accuracy through secondary sources.
Generate heatmaps to highlight affected areas during significant events.
Collaborate with security and operational teams for a coordinated response.
Implement incident containment and mitigation strategies.
Document incident details, response actions, and lessons learned.
Follow SOPs, MOPs, Runbooks, and Playbooks.
Monitor Closed-Circuit Television (CCTV) and Access Control Systems (ACS).
Track alarms for safety events including electrical issues, fire hazards, equipment failures, and water leaks.
Review camera footage for quality and area coverage.
Investigate and report access control incidents.
Report findings to the Security and Safety Engineering teams.
Follow SOPs, MOPs, Runbooks, and Playbooks.
Real-time monitoring of cloud infrastructure using tools such as AWS CloudWatch, Azure Monitor, and GCP Stackdriver.
Incident triage and escalation of alerts related to cloud-based services and resources (e.g., compute, storage, networking).
Coordination with Cloud Engineers and DevOps teams during cross-environment incidents to ensure rapid resolution and clear communications.
Identification and classification of cloud service anomalies, including misconfigurations, degraded services, and unauthorized access attempts.
Documentation of root cause analysis (RCA) and corrective actions for cloud incidents, feeding back into playbooks and runbooks.

Benefits

Medical provided through Cigna (PPO, HSA, EPO options) / Medical provided through Kaiser (HMO option only) for California employees only
Dental provided through Cigna (DPPO & DHMO options)
Nationwide Vision provided through VSP
Flexible Spending Account for Health & Dependent Care
Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)
Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera
Corporate Wellness Program
Employee Assistance Program
Wellness Days
401k Plan
Basic Life, Accidental Life, Supplemental Life Insurance
Short Term & Long Term Disability
Critical Illness, Critical Hospital, and Voluntary Accident Insurance
Tuition Reimbursement (available 6 months after start date, capped)
Paid Time Off (accrued and prorated, maximum of 120 hours annually)
Paid Holidays
Any other statutory leaves, paid time, or other fringe benefits required under state and federal law