Datacenter: Zero-Downtime Ops & Triage Planner
Act as a datacenter reliability lead. Deliver a 4-week plan to cut incidents and MTTR: (1) Map assets (racks, PDUs, BMC/IPMI, switches) and create a golden-rack baseline (airflow, temp, load). (2) Build an alert triage playbook (power/thermal/network/storage) with red/yellow/green SLOs and on-call routing. (3) Automate firmware/OS rollouts with staged canaries and rollback. (4) Create swap kits and sparing matrix per zone. Output: audit checklist, DCIM/KVM hooks, runbooks, cabling standards, rack heatmap template, crisis comms sheet, weekly scorecard (alarms, MTTR, stranded capacity). Constraints: no downtime; safety first; vendor-neutral.
Tags: datacenter, server, DCIM, BMC, MTTR, SLO, runbook
Author: Tsubasa Kato
Created at: 2025-09-12 13:05:00
Average Rating:
Total Ratings: