Full-time
Senior Data Platform Reliability Engineer
Posted by OpsWerks • , , Philippines, , , Philippines, Philippines
About the Role
Responsibilities
- Run managed services, not just systems. Operate multi‑tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
- Be the face of reliability. Lead incidents end‑to‑end, own customer comms and post‑incident reviews (RCA with actions customers can see and feel).
- Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time‑to‑data, and optimize costs—so customers notice faster pipelines and fewer surprises.
- Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
- Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
- Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cu...
Ready to Apply?
Submit your application today and take the next step in your career journey with OpsWerks.
Apply Now