Site Reliability Engineer | $40-$70/hr Remote
Overview
This Site Reliability Engineer role involves deploying, monitoring, and recovering containerized AI training environments. You will use advanced terminal techniques to manage infrastructure, automate tasks, and ensure system stability. The work directly supports training next-generation AI systems, leveraging your domain expertise rather than prior AI experience.
What You'll Do7
- 1Lead deployment, monitoring, and recovery of containerized AI training environments using advanced terminal techniques
- 2Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes
- 3Orchestrate resilient system builds and infrastructure management to ensure stability and optimal resource utilization
- 4Collaborate with engineering teams to refine CI/CD pipelines and automate routine operational tasks
- 5Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes
- 6Conduct rapid mid-execution replanning during error states and unforeseen runtime issues
- 7Document best practices, emergent solutions, and contribute to knowledge transfer across the team
Requirements5
- 1Demonstrated expert proficiency with terminal-based problem solving and complex system administration
- 2Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration
- 3Strong Python skills for scripting, automation, and debugging production systems
- 4Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, or C/C++
- 5Experience with build systems, package managers, databases, version control, and cryptography tools
Who Should Apply
This role is ideal for an experienced SRE with deep terminal skills and container orchestration expertise. You should be comfortable with dynamic infrastructure recovery, long-running process management, and scripting in Python. A background in ML ops or AI infrastructure is a plus, but not required.
Salary Insight
$40-$70 per hour (contract position).
Required Skills
Application Tip
Highlight your experience with terminal-based problem solving and containerized environments (Docker/Kubernetes) by providing specific examples of complex infrastructure recovery scenarios you've handled.