Software Engineer – DevOps / MLOps Engineer

Chefman•Mahwah, NJ

7h•$115,000 - $150,000

About The Position

In 2020, we launched CHEF iQ, an ecosystem of connected kitchen appliances designed to transform how people cook and connect through food. Our mission is to make great cooking effortless through intelligent technology, guided experiences, and seamless integration between hardware, software, and AI. As CHEF iQ continues to expand its AI capabilities, we are building the infrastructure and platforms that will power the next generation of connected cooking experiences. From machine learning and computer vision to Generative AI applications, our success depends on scalable, reliable systems that enable rapid innovation and deployment. We are seeking a highly detail-oriented DevOps / MLOps Engineer to serve as a critical partner to our Machine Learning Engineer, building and maintaining the cloud infrastructure, deployment pipelines, automation frameworks, and operational foundations that support AI development at scale. This individual will play a key role in improving engineering efficiency, increasing system reliability, and ensuring our AI-powered products can be developed, deployed, and scaled successfully. The ideal candidate is passionate about automation, process improvement, and building highly scalable systems. They enjoy creating order from complexity, eliminating operational bottlenecks, and enabling teams to move faster. Experience supporting AI, machine learning, and Generative AI applications in production environments is required.

Requirements

5+ years of experience in DevOps, MLOps, Site Reliability Engineering (SRE), Platform Engineering, or related software engineering roles.
Strong hands-on experience with AWS services and cloud-native architecture.
Experience building and supporting AI and machine learning platforms in AWS environments.
Experience supporting AI, machine learning, and Generative AI applications in production environments.
Experience working with AWS Bedrock, Generative AI services, foundation models, LLM-powered applications, or related AI infrastructure.
Strong experience implementing Infrastructure as Code using Terraform.
Experience with containerization and orchestration technologies such as Docker and Kubernetes.
Strong experience building and maintaining CI/CD pipelines and deployment automation.
Experience supporting machine learning workflows, model deployment, monitoring, and MLOps platforms.
Strong programming and scripting skills in Python, Bash, or similar languages.
Experience with monitoring, logging, observability, and operational tooling.
Strong troubleshooting, systems-thinking, and problem-solving abilities.
Excellent communication and cross-functional collaboration skills.
Proven track record of improving engineering processes, increasing operational efficiency, and scaling software platforms.
Highly organized and detail-oriented with a passion for automation and continuous improvement.

Nice To Haves

Experience with vector databases, retrieval-augmented generation (RAG), model serving, and AI infrastructure.
Experience supporting connected devices, IoT platforms, embedded systems, or consumer technology products.
Domain expertise in machine learning infrastructure, AI platforms, consumer applications, connected products, or similar technology environments.
Experience working in fast-paced startup or high-growth product organizations.

Responsibilities

Design, implement, and maintain scalable AWS cloud infrastructure supporting software, AI, and machine learning applications.
Build and manage Infrastructure as Code (IaC) using Terraform to ensure repeatable, secure, and scalable environments.
Develop and maintain CI/CD pipelines that enable rapid, reliable software and machine learning deployments.
Create and manage MLOps infrastructure for model training, deployment, monitoring, versioning, and lifecycle management.
Partner closely with the Machine Learning Engineer to establish the tools, workflows, and infrastructure required for successful AI development and deployment.
Support Generative AI initiatives by building infrastructure and deployment frameworks for applications utilizing AWS Bedrock, foundation models, LLMs, and related AI services.
Implement monitoring, logging, observability, and alerting systems across software, infrastructure, and machine learning platforms.
Continuously identify opportunities to improve engineering processes, reduce manual effort, increase automation, and improve system reliability.
Optimize cloud environments for scalability, performance, availability, and cost efficiency.
Support security, compliance, backup, disaster recovery, and operational best practices across all environments.
Troubleshoot infrastructure, deployment, and application issues across development, testing, and production environments.
Document infrastructure architecture, deployment processes, operational procedures, and engineering standards.
Contribute to establishing best practices for DevOps, MLOps, cloud architecture, and AI operations.