Web Scraping Specialist

MLabsNew York, NY
$75,000 - $150,000Remote

About The Position

We are hiring on behalf of our client who is seeking a Web Scraping Specialist to join a specialized technical team focused on building the infrastructure that delivers massive amounts of web data for the training of advanced AI models. This organization operates a massive distributed crawler and manages complex pipelines for ingesting, segmenting, and annotating billions of data points, including videos, transcripts, and audio files. The successful candidate will lead efforts to gather and analyze data, optimize scraping processes, and support the scaling of high-quality public web data accessibility. This role is ideal for a lean, technical builder who thrives in a fast-paced environment without bureaucratic red tape.

Requirements

  • Extraction Expertise: Demonstrated ability to extract data from complex websites with minimal supervision, supported by a portfolio of past projects.
  • Technical Proficiency: Advanced skills in Python or JavaScript, specifically with libraries and frameworks such as BeautifulSoup, Scrapy, or Selenium.
  • Advanced Programming: Strong knowledge of asynchronous programming, multithreading, and distributed scraping architectures.
  • Web Fundamentals: In-depth knowledge of HTML, CSS, JavaScript, and the Document Object Model (DOM).
  • Data Storage: Experience with NoSQL databases (e.g., MongoDB, Cassandra), including the ability to design efficient storage solutions.
  • Cloud Infrastructure: Experience deploying and managing large-scale scraping jobs using cloud services such as AWS, Google Cloud, or Azure.

Nice To Haves

  • Ability to apply machine learning algorithms for data cleaning, categorization, or predictive analysis
  • Active participation in relevant open-source projects

Responsibilities

  • Code Development: Write, test, and refine high-performance code to extract data from various online sources, ensuring maximum reliability and efficiency.
  • Data Retrieval: Manage complex data retrieval tasks, including handling pagination and dynamic content loaded via AJAX.
  • Data Quality: Clean and format extracted data to ensure it meets rigorous quality standards for downstream analysis and processing.
  • Database Management: Store and manage scraped data in appropriate databases, optimizing for both access speed and long-term data integrity.
  • Monitoring and Maintenance: Regularly monitor scraping processes and infrastructure to identify and resolve issues, ensuring a continuous and stable data flow.

Benefits

  • Competitive Compensation: A highly competitive salary ranging from $75,000 to $150,000 , complemented by a comprehensive benefits and equity package.
  • Impactful Work: The opportunity to work at the forefront of AI development and web-scale knowledge graph creation.
  • High-Output Culture: A professional environment that prioritizes low ego, technical autonomy, and rapid execution.
  • Remote Flexibility: This is a remote position requiring a 6-hour overlap with the core team's schedule.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service