Position Senior Cloud / SRE Engineer

Risk SolutionsRaleigh, NC
3d

About The Position

We are looking for a highly skilled SRE and Software Engineer responsible for continuous monitoring of live systems, proactive resolution of customer incidents and downtimes, and on-call support, with a focus on reducing response times for incidents and software bugs. REQUIRMENTS BS Engineering/Computer Science or equivalent experience required; advanced degree preferred Experience in software development, with a strong background in supporting and maintaining live products Experience in site reliability engineering, live production support, or a related role Proven experience managing and supporting live systems in a production environment. Proficiency in cloud platforms (e.g., AWS, Azure). Experience with monitoring tools (e.g., Datadog, Splunk). Strong scripting skills (e.g., Python, JS) and familiarity with automation tools. Solid understanding of Site Reliability Engineering (SRE) principles and practices. Strong understanding and experience with incident management, monitoring tools, IT service management frameworks and automation processes. Previous experience in customer-facing roles or managing customer support escalations Excellent technical problem-solving and troubleshooting abilities. Strong communication and interpersonal skills, with the ability to collaborate across teams. Tools and user management experience. Experience in defining and managing live support metrics and use them for continuous improvement process improvement Ability to manage multiple priorities and work effectively in a fast-paced environment. Passion for continuous learning and staying up-to-date with industry trends and best practices. RESPONSIBILITIES: 1. Live System Monitoring and Proactive Support: Design, implement, and maintain monitoring tools and processes to ensure continuous tracking of system performance, availability, and security. Proactively identify potential issues through trend analysis and monitoring data, and take corrective actions before they impact customers. Oversee the monitoring and proactive management of product performance, availability, and reliability. Manage SLAs, performance benchmarks, alert thresholds, and response protocols for live systems. Provide on-call and escalation management. Define and implement on-call escalation and resolution process for product incidents and issues Automate routine maintenance tasks to reduce manual intervention and improve system reliability. 2. Production Support & Observability: Oversee the smooth operation and availability of live systems, ensuring minimal downtime and prompt resolution of incidents. Lead the incident management process, including identification, troubleshooting, resolution, and post-incident analysis. Collaborate with product, development, infrastructure, quality engineering and customer success teams to ensure seamless deployment and support of new features and updates. 3. Documentation and Reporting: Maintain detailed documentation of system configurations, processes, and procedures. Provide regular reports on system performance, incident metrics, and customer issue resolution to senior management. Ensure compliance with industry standards and best practices for reliability and security. 4. Platforms and Automation: Works within or manages a cross-functional team in support of migrating applications to standard platforms. Gives direction and consultancy to others when implementing new Paved Road features or Platforms. Can analyze and make recommendations to improve the SDLC and CI/CD processes. Able to create actionable reports on the operational health and lifecycle of platform and product components. Work in a way that works for you We promote a healthy work/life balance across the organisation. We offer an appealing working prospect for our people. With numerous wellbeing initiatives, shared parental leave, study assistance and sabbaticals, we will help you meet your immediate responsibilities and your long-term goals. Working flexible hours - flexing the times when you work in the day to help you fit everything in and work when you are the most productive Working for you We know that your wellbeing and happiness are key to a long and successful career. These are some of the benefits we are delighted to offer: Health Benefits: Comprehensive, multi-carrier program for medical, dental and vision benefits Retirement Benefits: 401(k) with match and an Employee Share Purchase Plan Wellbeing: Wellness platform with incentives, Headspace app subscription, Employee Assistance and Time-off Programs Short-and-Long Term Disability, Life and Accidental Death Insurance, Critical Illness, and Hospital Indemnity Family Benefits, including bonding and family care leaves, adoption and surrogacy benefits Health Savings, Health Care, Dependent Care and Commuter Spending Accounts Up to two days of paid leave each to participate in Employee Resource Groups and to volunteer with your charity of choice About the Business LexisNexis Legal & Professional® provides legal, regulatory, and business information and analytics that help customers increase their productivity, improve decision-making, achieve better outcomes, and advance the rule of law around the world. As a digital pioneer, the company was the first to bring legal and business information online with its Lexis® and Nexis® services. U.S. National Base Pay Range: $104,900 - $174,700. Geographic differentials may apply in some locations to better reflect local market rates. This job is eligible for an annual incentive bonus. We know your well-being and happiness are key to a long and successful career. We are delighted to offer country specific benefits. Click here to access benefits specific to your location. We are committed to providing a fair and accessible hiring process. If you have a disability or other need that requires accommodation or adjustment, please let us know by completing our Applicant Request Support Form or please contact 1-855-833-5120. Criminals may pose as recruiters asking for money or personal information. We never request money or banking details from job applicants. Learn more about spotting and avoiding scams here. Please read our Candidate Privacy Policy. We are an equal opportunity employer: qualified applicants are considered for and treated during employment without regard to race, color, creed, religion, sex, national origin, citizenship status, disability status, protected veteran status, age, marital status, sexual orientation, gender identity, genetic information, or any other characteristic protected by law. USA Job Seekers: EEO Know Your Rights. At LexisNexis Reed Tech, our mission is to enable the advancement of humanity by delivering better outcomes to the innovation community. Our workflow and analytic solutions enable the innovation ecosystem to be more effective and efficient at bringing meaningful innovation to our world. We enable innovators to accomplish more by helping them make informed decisions, be more productive, comply with regulations and ultimately achieve superior results. LexisNexis Reed Tech is a part of LexisNexis Legal & Professional.

Requirements

  • BS Engineering/Computer Science or equivalent experience required; advanced degree preferred
  • Experience in software development, with a strong background in supporting and maintaining live products
  • Experience in site reliability engineering, live production support, or a related role
  • Proven experience managing and supporting live systems in a production environment.
  • Proficiency in cloud platforms (e.g., AWS, Azure).
  • Experience with monitoring tools (e.g., Datadog, Splunk).
  • Strong scripting skills (e.g., Python, JS) and familiarity with automation tools.
  • Solid understanding of Site Reliability Engineering (SRE) principles and practices.
  • Strong understanding and experience with incident management, monitoring tools, IT service management frameworks and automation processes.
  • Previous experience in customer-facing roles or managing customer support escalations
  • Excellent technical problem-solving and troubleshooting abilities.
  • Strong communication and interpersonal skills, with the ability to collaborate across teams.
  • Tools and user management experience.
  • Experience in defining and managing live support metrics and use them for continuous improvement process improvement
  • Ability to manage multiple priorities and work effectively in a fast-paced environment.
  • Passion for continuous learning and staying up-to-date with industry trends and best practices.

Responsibilities

  • Design, implement, and maintain monitoring tools and processes to ensure continuous tracking of system performance, availability, and security.
  • Proactively identify potential issues through trend analysis and monitoring data, and take corrective actions before they impact customers.
  • Oversee the monitoring and proactive management of product performance, availability, and reliability.
  • Manage SLAs, performance benchmarks, alert thresholds, and response protocols for live systems.
  • Provide on-call and escalation management.
  • Define and implement on-call escalation and resolution process for product incidents and issues
  • Automate routine maintenance tasks to reduce manual intervention and improve system reliability.
  • Oversee the smooth operation and availability of live systems, ensuring minimal downtime and prompt resolution of incidents.
  • Lead the incident management process, including identification, troubleshooting, resolution, and post-incident analysis.
  • Collaborate with product, development, infrastructure, quality engineering and customer success teams to ensure seamless deployment and support of new features and updates.
  • Maintain detailed documentation of system configurations, processes, and procedures.
  • Provide regular reports on system performance, incident metrics, and customer issue resolution to senior management.
  • Ensure compliance with industry standards and best practices for reliability and security.
  • Works within or manages a cross-functional team in support of migrating applications to standard platforms.
  • Gives direction and consultancy to others when implementing new Paved Road features or Platforms.
  • Can analyze and make recommendations to improve the SDLC and CI/CD processes.
  • Able to create actionable reports on the operational health and lifecycle of platform and product components.

Benefits

  • Comprehensive, multi-carrier program for medical, dental and vision benefits
  • 401(k) with match
  • Employee Share Purchase Plan
  • Wellness platform with incentives
  • Headspace app subscription
  • Employee Assistance and Time-off Programs
  • Short-and-Long Term Disability
  • Life and Accidental Death Insurance
  • Critical Illness
  • Hospital Indemnity
  • Family Benefits, including bonding and family care leaves, adoption and surrogacy benefits
  • Health Savings, Health Care, Dependent Care and Commuter Spending Accounts
  • Up to two days of paid leave each to participate in Employee Resource Groups and to volunteer with your charity of choice
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service