Site Reliability Engineer | Remote
Overview
This Site Reliability Engineer role involves applying domain expertise to help train next-generation AI systems by designing and maintaining scalable infrastructure. The engineer will monitor system health, automate operations, respond to incidents, and collaborate with development teams to ensure high availability. No prior AI experience is required—only deep knowledge of Linux, Kubernetes, and Prometheus.
What You'll Do7
- 1Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus.
- 2Monitor system health, analyze performance metrics, and proactively address bottlenecks or potential failures.
- 3Automate operational processes to minimize manual intervention and increase system reliability.
- 4Respond swiftly to incidents, conduct root cause analysis, and drive continuous improvements in incident response procedures.
- 5Collaborate with development and operations teams to deliver seamless deployments and high system availability.
- 6Create comprehensive documentation and clear runbooks for operational excellence and knowledge sharing.
- 7Champion best practices in SRE, security, and compliance across the customer's ecosystem.
Requirements7
- 1Expert-level hands-on experience with Linux system administration and troubleshooting.
- 2Advanced proficiency with Kubernetes, including cluster deployment, operations, and management.
- 3Deep knowledge of Prometheus for monitoring, metrics collection, and alerting.
- 4Strong scripting abilities (Bash, Python, or similar) for automation and tooling.
- 5Excellent written and verbal communication skills, with the ability to document and share knowledge effectively.
- 6Proven track record in site reliability engineering or similar roles in high-availability environments.
- 7Demonstrated commitment to proactive problem-solving and collaborative teamwork.
Who Should Apply
The ideal candidate is an expert in Linux, Kubernetes, and Prometheus with strong scripting skills, who has a proven track record in high-availability environments. They thrive on proactive problem-solving, enjoy automating operational tasks, and are eager to bring their domain knowledge to help train AI systems. No prior AI experience is needed, but a collaborative mindset and excellent communication skills are essential.
Salary Insight
Compensation not specified; this is a contractor position.
Required Skills
Application Tip
Highlight specific examples of how you've used Kubernetes and Prometheus to improve system reliability and automate incident response in previous roles.