We are seeking a highly skilled
Site Reliability Engineer (SRE)
to join our team and ensure the reliability, scalability, and performance of our cloud-based infrastructure.
You will work at the intersection of development and operations, focusing on automation, observability, and streamlined delivery across AWS ecosystems.
Responsibilities
- Design, implement, and maintain cloud infrastructure on AWS to enable scalable and high-performing systems
- Automate infrastructure provisioning and management using Infrastructure as Code (IaC) tools like Terraform
- Build and maintain CI/CD pipelines, integrating quality gates and deployment controls for seamless release processes
- Monitor system health and performance by implementing observability tools and creating dashboards for real-time insights
- Troubleshoot and resolve infrastructure issues in development and production environments to minimize downtime
- Collaborate closely with development teams to align infrastructure with application needs and support continuous improvement initiatives
- Document infrastructure configurations and operational procedures to ensure maintainability and knowledge sharing
- Stay up to date with best practices in DevOps and cloud technologies to enhance system reliability, cost-efficiency, and deployment cycles
- Promote a client-centric approach by focusing on infrastructure decisions that support internal teams and end users
Requirements
- 2+ years of experience with cloud platforms, especially AWS
- Expertise in using configuration management tools such as Ansible, with competency in Docker and Linux environments
- Proficiency in Infrastructure as Code (IaC) tools, specifically Terraform, to automate provisioning and deployment processes
- Background in building and managing CI/CD pipelines with a focus on integrating secure and reliable deployment practices
- Knowledge of system monitoring and observability best practices, with skills in creating and maintaining real-time dashboards
- Familiarity with troubleshooting and resolving complex infrastructure issues across development and production environments
- English proficiency at a B1+ level
Nice to have
- Familiarity with Azure DevOps for managing development workflows and pipelines
- Skills in scripting or automation tools, particularly Groovy, to enhance operational efficiency
- Understanding of build and automation tools like Jenkins for continuous delivery and integration efforts
We offer
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn