Site Reliability Engineer

Title: Technical Program Manager

Location: [Your City, Your State/Country]

About Us

[Company Name] is a fast-growing [industry/type] company committed to delivering high-quality, reliable, and scalable services. We pride ourselves on innovation and leveraging cutting-edge technology to provide world-class solutions to our customers. We are looking for dedicated individuals who are passionate about keeping services stable and optimized.

Position Overview

We are seeking an experienced Site Reliability Engineer (SRE) to join our engineering team. As an SRE, you will be responsible for ensuring the reliability, performance, and scalability of our systems and services. You will collaborate closely with development teams to improve system stability, automate repetitive tasks, and proactively address issues before they affect customers.

Key Responsibilities

Design, build, and maintain reliable infrastructure that supports the company’s production systems and services.
Develop and implement monitoring, alerting, and incident management processes to ensure system health and performance.
Collaborate with development and operations teams to improve the overall availability and performance of our services.
Identify and resolve performance bottlenecks, system failures, and issues related to scalability.
Automate repetitive tasks and processes, such as deployments, scaling, and monitoring, to improve efficiency and reduce human error.
Develop and maintain tools for continuous integration, automated testing, and continuous deployment.
Conduct root cause analysis of incidents and implement long-term solutions to prevent recurrence.
Maintain and enhance disaster recovery, backup, and failover strategies to ensure high availability and data integrity.
Stay up-to-date with the latest technologies and trends in DevOps, cloud computing, and system reliability.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related field.
Minimum of 3-5 years of experience in a Site Reliability Engineering, DevOps, or similar role.
Strong knowledge of cloud platforms (e.g., AWS, Google Cloud, Azure) and experience with cloud infrastructure management.
Proficiency in programming or scripting languages (e.g., Python, Go, Bash) for automation and system management.
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack).
Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
Experience with CI/CD pipelines and automation tools (e.g., Jenkins, GitLab, CircleCI).
Strong problem-solving skills and the ability to troubleshoot complex systems in real-time.
Excellent communication and collaboration skills, with the ability to work cross-functionally with development, operations, and security teams.

Why Join Us

Opportunity to work on mission-critical systems and make a significant impact on the reliability and scalability of our services.
Collaborative and dynamic work environment.
Competitive salary and comprehensive benefits package.
Opportunities for professional development and career advancement.