Frontline Engineer

TCGplayer
24dRemote

About The Position

At eBay, we're more than a global ecommerce leader — we’re changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We’re committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts. Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work — every day. We're in this together, sustaining the future of our customers, our company, and our planet. Join a team of passionate thinkers, innovators, and dreamers — and help us connect people and build communities to create economic opportunity for all. About the team and the role: TCGplayer, now a part of eBay, connects a global community of millions of buyers with tens of thousands of retailers in a $25B global collectible card game and collectible hobbyist space. We pride ourselves on a culture of inclusion that fosters camaraderie, embraces diversity, and exudes passion. The Frontline Engineering team at TCGplayer (part of eBay) plays a pivotal role in ensuring the reliability, availability, and seamless performance of our platform, which serves millions of buyers and sellers globally within the $25B collectible hobbyist space. As the first line of defense for incident response and problem management, you'll have a direct impact on customer trust and satisfaction. Reporting to our Incident and Problem Management leader, you'll collaborate closely with Cloud Operations, Site Reliability Engineering (SRE), Engineering teams, and Product stakeholders. Our team fosters an inclusive and collaborative culture, encouraging open communication, continuous learning, and professional growth. This position is fully remote with a preference for candidates working within Eastern Standard Time (EST) or Central Standard Time (CST) hours. Participation in an on-call rotation and occasional off-hours support for incidents is required.

Requirements

  • A Bachelor’s degree in a technical field or equivalent experience (5+ years) in system administration, infrastructure engineering, or related roles; relevant certifications are a plus.
  • Direct experience as an incident commander, including managing live incident calls, coordinating triage efforts, and driving communications during high-pressure situations.
  • Strong communication skills with the ability to clearly articulate technical details and strategies to both technical and non-technical stakeholders.
  • Excellent problem-solving capabilities, able to stay composed and decisive under pressure during high-impact incidents.
  • Hands-on operational experience with AWS in a production environment, specifically executing runbooks, restarting EC2 instances, checking alarms, and pulling logs from CloudWatch.
  • Proficiency with Kubernetes, including troubleshooting containerized workloads, understanding pod health, managing deployments, and interacting directly with Kubernetes clusters.
  • Experience with scripting (Python, PowerShell, or Bash) to automate operational tasks or assist in incident resolution workflows.

Responsibilities

  • Serve as Incident Commander, leading real-time response efforts, managing communication across teams, triaging issues, and driving resolution of high-priority incidents to minimize downtime and business disruption.
  • Execute documented runbooks for troubleshooting and resolving production incidents involving AWS services (EC2, CloudWatch, IAM) and Kubernetes clusters (pods, deployments, scaling).
  • Collaborate closely with engineering teams post-incident, performing root cause analysis, documenting lessons learned, and driving the implementation of durable solutions.
  • Drive operational excellence by measuring and analyzing critical metrics (e.g., MTTR, SLA adherence) to identify improvement opportunities and implement impactful solutions.
  • Continuously refine and update operational runbooks and procedures, ensuring alignment with evolving technologies and business needs.
  • Proactively contribute to long-term strategic initiatives to improve incident management practices.

Benefits

  • The total compensation package for this position may also include other elements, including a target bonus and restricted stock units (as applicable) in addition to a full range of medical, financial, and/or other benefits (including 401(k) eligibility and various paid time off benefits, such as PTO and parental leave).
  • Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service