About the Role
Lead technical response during critical service incidents, ensuring swift recovery and minimal business disruption.Build early-warning and real-time visibility using observability platforms and monitoring data.Develop dashboards, alert thresholds, and recovery indicators for critical services and infrastructure.Conduct structured root-cause analysis and drive permanent corrective actions to prevent recurrence.Collaborate with Network, Platform, and Application teams to strengthen continuity measures and response readiness.Implement automation-driven remediation steps, reducing manual resolution time and repetitive interventions.Maintain and prioritise a continuity backlog focused on recurrence prevention and operational gaps.Reduce false alarms, repeated disruptions, and reactive firefighting through proactive engineering practices.Improve uptime, recovery time, fault prevention, and continuity metrics across S...