We are seeking a skilled and proactive DevOps Engineer to join our team. This role is focused on developing and managing scalable infrastructure and deployment workflows in AWS to support data-driven and machine learning applications. You will play a key role in building cloud-native systems with a strong emphasis on infrastructure as code, containerization, and CI/CD pipelines. A solid understanding of AWS services and Python is essential, particularly for authoring infrastructure using AWS CDK. Experience with SageMaker and knowledge of ML systems is a strong advantage. Qualifications: * 3–5 years of experience in DevOps, cloud infrastructure, or SRE roles. * Proficient in AWS services, especially CDK, Lambda, EC2, S3, SageMaker, and CloudWatch. * Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation. * Strong experience with Python for scripting and infrastructure automation. * Hands-on experience with containerization (Docker). * Experience building and maintaining CI/CD pipelines.
Preferred Qualifications: * AWS Certifications (e.g., DevOps Engineer, Solutions Architect, or Machine Learning Specialty). * Background in software engineering or ML/AI infrastructure is a plus.
Key Responsibilities: Infrastructure Development & Automation: * Design, provision, and manage AWS infrastructure using AWS CDK and CloudFormation. * Develop secure, scalable, and cost-efficient infrastructure to support machine learning and analytics workloads. * Implement and manage cloud-native services such as EC2, ECS, Lambda, S3, RDS, SageMaker, and Bedrock. * Ensure best practices for security, compliance, and disaster recovery are followed.
CI/CD & Deployment Automation: * Design and maintain CI/CD pipelines for application and model deployment using tools like CodePipeline, CodeBuild, GitHub Actions, or similar. * Automate testing, deployment, and rollback procedures to support continuous integration and delivery.
Containerization & Orchestration: * Build and manage Docker containers for microservices and ML applications. * Support deployment on ECS or Lambda with container-based runtimes. * Implement image build, versioning, and artifact management workflows.
Machine Learning & Model Operations Support: * Collaborate with ML engineers to deploy, monitor, and maintain models in SageMaker. * Integrate infrastructure for pre-processing, inference, and retraining pipelines. * Support model performance monitoring, logging, and metrics collection.
Monitoring, Observability & Logging: * Set up monitoring and alerting using CloudWatch, DataDog, and other observability tools. * Troubleshoot and resolve infrastructure, deployment, and performance issues proactively.
Collaboration & Documentation: * Work closely with software, ML, and data teams to support DevOps best practices across the ML lifecycle. * Maintain clear documentation for infrastructure, deployments, and operations processes. * Participate in code reviews and architectural discussions.