Overview
We are looking for an experienced DevOps Engineer to join EPAM’s team.
This position focuses on the implementation, automation, and optimization of Kubernetes-based orchestration platforms, such as Volcano for GPU-enabled workloads, while managing Linux infrastructure to support advanced AI and research projects.
The ideal candidate is driven by a passion for scalable infrastructure, efficiency, and enabling cutting-edge computational solutions.
Responsibilities
- Deploy, configure, and maintain GPU-enabled Kubernetes clusters and Linux compute environments to ensure efficient workload performance and scheduling
- Manage and implement Volcano job scheduling, including queue setup, POD execution, GPU resource allocation, and namespace quota management
- Administer Kubernetes environments, covering namespaces, RBAC, resource quotas, and workload isolation techniques
- Create and maintain Python and Shell scripts to automate job submissions, resource provisioning, and system reporting functionalities
- Collaborate with teams focused on orchestration, optimization, and observability to improve capacity utilization, scheduling efficiency, and researcher operations
- Monitor infrastructure health and resource usage, providing insights and data for optimization and reporting needs
- Propose and implement improvements for infrastructure workflows, tools, and automation to enhance usability, scalability, and performance
- Ensure smooth operational processes to provide researchers with efficient support for diverse AI and computational workloads
Requirements
- At least 2 years of experience in DevOps or infrastructure engineering roles within complex, large-scale environments
- Expertise in Kubernetes administration, including namespaces, POD scheduling, PVC, NFS, and resource quota management
- Practical experience with Volcano scheduler for GPU job execution, queue configuration, workload prioritization, and Kubernetes integration
- Demonstrated ability to manage GPU cluster environments, both within Kubernetes and standalone Linux compute nodes
- Advanced Python scripting expertise for infrastructure automation, along with proficiency in UNIX Shell scripting (e.g., Bash)
- Strong Linux system administration capabilities, including troubleshooting, performance optimization, and configuration management
- Solid understanding of infrastructure automation and orchestration tools and methodologies
- Proficiency in English at a B2+ level, both written and spoken, for effective communication
Nice to have
- Experience managing Kubernetes applications with Helm package management
- Familiarity with monitoring and observability tools such as Prometheus, Grafana, and Loki
- Knowledge of Infrastructure as Code tools like Terraform
- Experience with multi-cloud Kubernetes environments, including platforms like Amazon EKS and Google GKE
- Understanding of Azure networking, including VPN, ExpressRoute, and network security concepts
- Familiarity with AI-assisted coding tools, including GitHub Copilot, ChatGPT, or Claude
- Knowledge of hybrid cloud and on-premises resource scheduling and optimization strategies
We offer
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Seniority level
Employment type
Job function
- Engineering, Information Technology, and Business Development
Industries
- Software Development, IT Services and IT Consulting, and Technology, Information and Internet
Referrals increase your chances of interviewing at EPAM Systems by 2x
Additional
Santiago, Santiago Metropolitan Region, Chile 4 weeks ago
Santiago, Santiago Metropolitan Region, Chile $50,000.00-$60,000.00 2 weeks ago
Santiago, Santiago Metropolitan Region, Chile 1 month ago
Santiago Metropolitan Region, Chile 2 weeks ago
Santiago, Santiago Metropolitan Region, Chile 1 month ago
Santiago Metropolitan Region, Chile 4 weeks ago
Santiago, Santiago Metropolitan Region, Chile 1 month ago
Santiago, Santiago Metropolitan Region, Chile 1 month ago
Santiago, Santiago Metropolitan Region, Chile 2 weeks ago
Senior Site Reliability / Gitops Engineer
Santiago, Santiago Metropolitan Region, Chile 1 month ago
We’re unlocking community knowledge in a new way.
Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr