We are seeking a highly skilled
Site Reliability Engineer (SRE)
to join our team and ensure the reliability, scalability, and performance of our cloud-based infrastructure.
You will work at the intersection of development and operations, focusing on automation, observability, and streamlined delivery across AWS ecosystems.
Responsibilities 
- Design, implement, and maintain cloud infrastructure on AWS to enable scalable and high-performing systems 
- Automate infrastructure provisioning and management using Infrastructure as Code (IaC) tools like Terraform 
- Build and maintain CI/CD pipelines, integrating quality gates and deployment controls for seamless release processes 
- Monitor system health and performance by implementing observability tools and creating dashboards for real-time insights 
- Troubleshoot and resolve infrastructure issues in development and production environments to minimize downtime 
- Collaborate closely with development teams to align infrastructure with application needs and support continuous improvement initiatives 
- Document infrastructure configurations and operational procedures to ensure maintainability and knowledge sharing 
- Stay up to date with best practices in DevOps and cloud technologies to enhance system reliability, cost-efficiency, and deployment cycles 
- Promote a client-centric approach by focusing on infrastructure decisions that support internal teams and end users 
Requirements 
- 2+ years of experience with cloud platforms, especially AWS 
- Expertise in using configuration management tools such as Ansible, with competency in Docker and Linux environments 
- Proficiency in Infrastructure as Code (IaC) tools, specifically Terraform, to automate provisioning and deployment processes 
- Background in building and managing CI/CD pipelines with a focus on integrating secure and reliable deployment practices 
- Knowledge of system monitoring and observability best practices, with skills in creating and maintaining real-time dashboards 
- Familiarity with troubleshooting and resolving complex infrastructure issues across development and production environments 
- English proficiency at a B1+ level 
Nice to have 
- Familiarity with Azure DevOps for managing development workflows and pipelines 
- Skills in scripting or automation tools, particularly Groovy, to enhance operational efficiency 
- Understanding of build and automation tools like Jenkins for continuous delivery and integration efforts 
We offer 
- International projects with top brands 
- Work with global teams of highly skilled, diverse peers 
- Healthcare benefits 
- Employee financial programs 
- Paid time off and sick leave 
- Upskilling, reskilling and certification courses 
- Unlimited access to the LinkedIn Learning library and 22,000+ courses 
- Global career opportunities 
- Volunteer and community involvement opportunities 
- EPAM Employee Groups 
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn