Job Overview
Category
Arquitectura y diseño de software
Ready to Apply?
Take the Next Step in Your Career
Join EPAM Systems and advance your career in Arquitectura y diseño de software
Apply for This Position
Click the button above to apply on our website
Job Description
Responsibilities
- Deploy, configure, and manage GPU-enabled Kubernetes clusters and standalone Linux compute environments to ensure optimal performance and workload scheduling
- Implement and administer Volcano job scheduling, including queue configuration, POD execution, GPU allocation, and namespace quota enforcement
- Oversee end-to-end Kubernetes environments, including namespaces, RBAC, resource quotas, and workload isolation strategies
- Develop and maintain automation scripts using Python and Shell to streamline job submission, resource provisioning, and system reporting
- Collaborate with orchestration, optimization, and observability teams to enhance scheduling efficiency, capacity utilization, and researcher workflows
- Monitor the health and resource utilization of infrastructure, providing insights and data to support optimization and reporting requirements
- Identify and recommend improvements for infrastructure, tooling, and automation workflows to enhance scalability, usability, and performance
- Ensure seamless operational processes to deliver efficient experiences for researchers working on diverse AI and computational workloads
Requirements
- At least 3 years of experience in DevOps or infrastructure engineering roles within large-scale, complex environments
- Advanced proficiency in Kubernetes administration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management
- Hands-on experience with Volcano scheduler, including GPU job execution, queue configuration, workload prioritization, and Kubernetes integration
- Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes
- Advanced Python scripting skills for automating infrastructure tasks, along with strong UNIX Shell scripting expertise (e.g., Bash)
- Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management
- Solid understanding of infrastructure automation and orchestration concepts and tools
- Fluent English skills, both written and spoken, at B2+ level or higher
Nice to have
- Experience with Helm package management for Kubernetes applications
- Knowledge of monitoring and observability tools, including Prometheus, Grafana, and Loki
- Familiarity with Infrastructure as Code tools such as Terraform
- Multi-cloud Kubernetes experience across platforms like Amazon EKS and Google GKE
- Understanding of Azure networking concepts, including VPN, ExpressRoute, and network security
- Experience with AI-assisted coding tools such as GitHub Copilot, ChatGPT, or Claude
- Knowledge of hybrid environments combining cloud and on-premises resource scheduling and optimization
We offer
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Job details
- Seniority level: Mid-Senior level
- Employment type: Full-time
- Job function: Engineering, Information Technology, and Business Development
- Industries: Software Development, IT Services and IT Consulting, and Technology, Information and Internet
#J-18808-Ljbffr
Don't Miss This Opportunity!
EPAM Systems is actively hiring for this Senior DevOps Engineer position
Apply Now