Observability Engineer (US)

TDMount Laurel, NJ
5dOnsite

About The Position

The Observability Engineer provides technical leadership to improve the design and operation of systems in alignment to reliability engineering best practices and overall Technology and Bank strategies, applying the practices of computer science and software engineering to the design and development of large, complex systems. Expert Observability Engineer role with comprehensive expertise in leading-edge theories, engineering practices, extensive coding and scripting Advanced and highly specialized knowledge of applications, systems, networks, innovation models, design activities, best practices, business / organization, Bank standards, and may fulfill a governance role Engineering specialist assigned to work autonomously on high profile, complex and/or high-risk technology initiatives with significant impact to the organization Provides technical leadership / consulting / direction to multiple businesses and product teams, growing capability across the organization Resolves unique and complex problems that have a broad impact on the business Authoritative expert on site reliability issues within area of specialization Understands the journey of an enterprise transformation where there is a hybrid cloud/non-cloud operating model. Drives end/end accountability of products and services across the enterprise through collaboration and transparency Primarily works at the product umbrella, segment, LOB or Product Family level

Requirements

  • University degree in Computer Science or related technical field involving systems engineering or equivalent practical experience.
  • 10+ years of engineering experience (e.g. Software or platform)

Nice To Haves

  • Strong SRE foundations
  • Expertise in reliability engineering concepts including SLI, SLO, error budget, incident and problem management, resiliency patterns, etc.
  • Experience with observability platforms including Dynatrace, Splunk, Datadog, etc.
  • Experience with automated monitoring and alerting: building monitoring frameworks, dashboards, alert tuning.
  • Advanced understanding of distributed and cloud native systems engineering including scalability, high availability, and performance optimization
  • Strong hands on experience with Azure and infrastructure as code using Terraform
  • Experience with CI/CD pipelines and automated testing frameworks
  • Proficiency in at least one modern language (Python or Java) with strong scripting skills
  • Strong technical leadership and cross functional influencing experience with ability to communicate abstract technical concepts in business meaningful terms.
  • Technical strategy and roadmap development experience
  • 3 + years of design, build, and deployment of cloud native solutions on public and private cloud, including AWS and Azure.
  • 3+ years of design, review, and coding experience integrating various application requirements (non-functional, security, integration, performance, quality, operations etc.).
  • Infrastructure as code experience with Terraform, AWS Cloud Formation,
  • 3+ years of demonstrated experience with Virtual networks, Load Balancing, VPN's, VPC's, etc.
  • 3+ years of Dockers and Kubernetes.
  • 3+ years of Monitoring and logging with cloud watch, data dog, Grafana, etc.
  • Cloud Certifications such as AWS, Azure are highly desired.

Responsibilities

  • Provides technical leadership to improve the design and operation of systems in alignment to reliability engineering best practices and overall Technology and Bank strategies, applying the practices of computer science and software engineering to the design and development of large, complex systems
  • Drives and influences integrated DevOps solutions across business, product, platform, infrastructure, development, support/DevOps teams that improve the design and operation of systems, making them scalable, reliable, and efficient while ensuring performance and high availability of products/services
  • Ensures availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of products/service(s) including enterprise systems that may serve multiple services and applications/segments
  • Influences and partners with key technology and product team members in the design and development of solutions that promote automation and the elimination of toil; identify optimal ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient and have the ability to implement the required changes
  • Defines and prioritizes problems to solve with applications / products / services and respective systems and drive the resolution / remediation with technology teams across design, implementation, and support.
  • Develops deep relationships with Product Owners, Tech Leads and Ops to build transparency and help foster end to end accountability of products and services
  • Works in close partnership with technology teams to support TD's business objectives and operational support goals providing domain expertise on strategic Infrastructure as well as Business project related activities
  • Reviews technical deliverables throughout the design and development phase to ensure systems adhere to SRE best practices
  • Ensures adherence of Operational (Production) Readiness practices of respective products and services
  • Sets service-level objectives (SLO) that defines availability of a particular product or service and exercise key decision rights of the SRE role (e.g. supporting release to production, rejecting software that is operationally substandard and directing developers to improve the code etc.)
  • Implements the observability requirements to monitor and assure that our systems measure to the expected service levels and perform with the appropriate operational characteristics
  • Focuses on reliability, scalability, and the development of the production computing infrastructure, including highly complex and scalable systems
  • Develops observability standards to ensure that production systems operate under known conditions and transparently provides these measurements to anticipate when errors or failures can arise.
  • Engineers solutions through problem post-mortem reviews to ensure that problem solutions are complete and that errors will not manifest again.
  • Anticipates internal and external business challenges, helping teams find solutions through continuously improving on process and technologies
  • Leads interaction with governance and control groups, (e.g. regulatory / operational risk, compliance and audit) to provide subject matter expertise and consult on risk issues related to Engineering technology and tools
  • Leads or contributes to cross-functional / enterprise initiatives as an organizational or subject matter expert helping to identify risk / provide guidance for significant and complex situations
  • Proactively identifies emerging technologies and innovative solutions for building more robust platform domains; keep abreast of emerging issues, trends, and evolving regulatory requirements and assess potential impacts
  • Protects the interests of the organization – identify and manage risks, and escalate non-standard, high-risk transactions / activities as necessary
  • Maintains a culture of risk management and control, supported by effective processes in alignment with risk appetite
  • Participates fully as a member of the team, support a positive work environment that promotes service to the business, quality, innovation, and teamwork and ensure timely communication of issues/ points of interest
  • Provides thought leadership and/ or industry knowledge for own area of expertise in own area and participate in knowledge transfer within the team and business unit
  • Keeps current on emerging trends/ developments and grow knowledge of the business, related tools, and techniques
  • Participates in personal performance management and development activities, including cross training within own team
  • Keeps others informed and up to date about the status / progress of projects and / or all relevant or useful information related to day-to-day activities
  • Contributes to team development of skills and capabilities through mentorship of others, by sharing knowledge and experiences and leveraging best practices
  • Leads, motivates and develops relationships with internal and external business partners / stakeholders to develop productive working relationships
  • Contributes to a fair, positive and equitable environment that supports a diverse workforce
  • Acts as a brand ambassador for your business area/function and the bank, both internally and/or externally

Benefits

  • TD is committed to providing fair and equitable compensation opportunities to all colleagues.
  • Growth opportunities and skill development are defining features of the colleague experience at TD.
  • Our compensation policies and practices have been designed to allow colleagues to progress through the salary range over time as they progress in their role.
  • Total Rewards at TD includes base salary and variable compensation/incentive awards (e.g., eligibility for cash and/or equity incentive awards, generally through participation in an incentive plan) and several other key plans such as health and well-being benefits, savings and retirement programs, paid time off (including Vacation PTO, Flex PTO, and Holiday PTO), banking benefits and discounts, career development, and reward and recognition.
  • Through regular development conversations, training programs, and a competitive benefits plan, we’re committed to providing the support our colleagues need to thrive both at work and at home.
  • You’ll have regular career, development, and performance conversations with your manager, as well as access to an online learning platform and a variety of mentoring programs to help you unlock future opportunities.
  • We will provide training and onboarding sessions to ensure that you’ve got everything you need to succeed in your new role.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service