TSMC-posted 6 months ago
Full-time • Senior
Phoenix, AZ
5,001-10,000 employees

A job at TSMC Arizona offers an opportunity to work at the most advanced semiconductor fab in the United States. TSMC Arizona’s first fab will operate its leading-edge semiconductor process technology (N4 process), starting production in the first half of 2025. The second fab will utilize its leading edge N3 and N2 process technology and be operational in 2028. The recently announced third fab will manufacture chips using 2nm or even more advanced process technology, with production starting by the end of the decade. America’s leading technology companies are ready to rely on TSMC Arizona for the next generations of chips that will power the digital future. As a Senior Data Engineer in the AI Data Curation track, you will ensure that the data powering our AI models is high-quality, well-organized, and fit for use in model training and deployment. You will play a key role in designing and maintaining scalable data pipelines, ensuring that data is clean, relevant, and aligned with ethical and compliance standards.

  • Design and implement data pipelines for processing, cleaning, and curating large datasets used in model training and fine-tuning.
  • Automate data cleaning processes (e.g., removing noise, duplicates, irrelevant content) and ensure datasets are appropriately labeled and structured.
  • Collaborate with model teams to ensure data aligns with model requirements and performance goals.
  • Assess and mitigate bias in datasets, ensuring that models are trained on diverse and representative data.
  • Manage data storage and retrieval strategies, ensuring scalability and data consistency across different environments.
  • Conduct regular audits to ensure data integrity, privacy, and security compliance.
  • Bachelor's degree in Computer Science, Data Science, or a related field.
  • 5+ years of experience in data engineering, data wrangling, or data curation, particularly in machine learning or AI-driven environments.
  • Strong proficiency in Python (Pandas, NumPy) and SQL for data manipulation and querying.
  • Familiarity with cloud-based data storage (AWS S3, Google Cloud Storage, etc.) and distributed systems for managing large datasets.
  • Experience with data annotation tools and platforms for manual or semi-automated labeling.
  • Experience with NLP data formats, such as JSONL, text, or embeddings, and an understanding of tokenization.
  • Experience managing data pipelines with tools like Apache Kafka, Apache Airflow, or similar ETL tools.
  • Strong knowledge of AI ethics, data privacy, and compliance standards (GDPR, CCPA, etc.).
  • Experience with vector databases and indexing for LLMs (e.g., FAISS, Pinecone).
  • Comprehensive medical, dental, and vision plan offerings.
  • Income-protection programs for injury or illness.
  • 401(k)-retirement savings plan.
  • Competitive paid time-off programs and paid holidays.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service