About The Position

As a Systems Engineer specializing in Performance Engineering and Site Reliability Engineering (SRE), you will drive the reliability, scalability, and performance of critical enterprise systems. The Performance Engineer has the responsibility of creating and implementing performance test plans to evaluate system operations and detect performance bottlenecks. They should focus on SRE practices, ensuring robust system operations, automation, and continuous improvement. They should have the ability to analyze the CPU usage, memory usage, and other performance metrics of the application with the help of testing tools. They also develop and suggest monitoring profiles for the underlying infrastructure and work closely with technical stakeholders to interpret test results and identify possible system backlogs.

Requirements

  • Bachelor's degree in Computer Science. Information Science or related field
  • 5+ years of experience in architecting performance test automation solutions and SRE practices
  • Experience in performance testing web applications, and middleware/applications.
  • Implemented at least one Chaos testing tool.
  • Derive and execute chaos experiments at different layers of application on cloud infrastructure.
  • Proven ability to create automated test scripts, test scenarios, and analyze results using LoadRunner, JMeter, and BlazeMeter
  • Experience in performance testing and tuning of complex large-scale enterprise applications in the Retail industry.
  • Ability to identify system bottlenecks with strong troubleshooting, problem solving & reasoning skills.
  • Strong programming skills (Python required; Java a plus).
  • A systems thinker, able to move fluidly between high-level abstract thinking and detail-oriented implementation, open-minded to new ideas, approaches, and technologies.
  • Candidate must have strong experience with Python, JMeter, code profiling, and monitoring/observability tools.
  • Data mining experience using custom shell scripts and leveraging complex Splunk queries for troubleshooting, and testbed setup.
  • Experience in reporting to all levels of an organization regarding testing results and the ability to build monitoring dashboards.
  • Database knowledge, indexes, and SQL optimization techniques in Oracle.
  • Proficiency in monitoring/observability tools (Dynatrace required)
  • Good understanding of factors influencing the performance of software applications at multiple layers including Database, network, CPU utilization, JVM tuning, memory analysis, thread management, query performance, etc.
  • Solid understanding of APIs and experience in creating and measuring performance for Web Services
  • Knowledge of UNIX, Linux, Windows, Java, MS SQL, C, C++, Python, GoScript, Greenplum, ATG, QT4, Oracle, Excel macros, APIGEE, PingIdentity, Kafka, TCP/IP, Networking and LAN monitoring.
  • Experience with cloud platforms (AWS, Azure, GCP) and infrastructure automation (Terraform, Ansible, etc.) and running performance tests against cloud-based services.
  • Experience in reporting and building dashboards for technical and non-technical audiences.

Responsibilities

  • Develop and implement performance test plans to evaluate system operations and detect bottlenecks.
  • Analyze CPU, memory, and other performance metrics using industry-standard tools.
  • Identify, track, and communicate performance issues, memory leaks, and bottlenecks to stakeholders.
  • Collaborate with engineers, architects, and business teams to define performance SLAs and monitoring strategies.
  • Lead and mentor performance testing teams, conduct chaos testing, and replicate production issues in test environments.
  • Define and enforce resilience and reliability best practices.
  • Plan and manage deliverables for performance diagnostics, capacity planning, architecture design, tuning, and monitoring.
  • Conduct system security, performance, and stress testing; analyze results and recommend improvements.
  • Identify areas for performance and process improvement, and define roadmaps for enhancement.
  • Perform web/mobile application and network penetration testing, including vulnerability exploitation and documentation.
  • Provide technical assistance to improve system performance, capacity, reliability, and scalability.
  • Collaborate on service-level objectives (SLOs), error budgets, and reliability metrics.
  • Implement and refine observability solutions (metrics, logs, traces) to proactively detect and resolve issues.
  • Champion best practices for system reliability, scalability, and disaster recovery.
  • Work closely with development and operations teams to ensure seamless integration of reliability engineering into the software lifecycle.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service