I work for a large healthcare organization -- several hospital buildings, many clinics, other ancillary sites. I started when the network was basic L2/L3, L2 access, gateways on the cores. We migrated to L3 access, then multiple routing tables, and finally a large enterprise MPLS environment. We built a second DC six years ago and find ourselves in TLA and FLA hell -- STP, EIGRP, OSPF, BGP, MPLS, LDP, vPC, LACP, OTV, etc. etc. etc. Soon, we'll add EVPN and VxLAN to those. And we're going multi-vendor.
With this added complexity -- and not just our team -- more human errors occur. Our division director purchased several copies of Dr. Atul Gawande's The Checklist Manifesto. He wants us using checklists for routine work and troubleshooting. Checklists have saved my bacon many times, but others are still resistant to them. We troubleshot a problem with a new site the other day; a NAT was missed. Something ripe for a "new site" checklist.
I'm not looking for examples of checklists. But I want to know: do you use checklists? Where do you use them most (installation, troubleshooting)? Do you (your team) write your own? How do you convince team members to use them? Do you review them regularly? How do you account for team members of different skill levels? (We are all seniors, but some specialize in BGP, or firewalls, or MPLS, etc.)?
My position evolved into a QA/QI role. I see opportunities for checklists, but I want to target the problem areas, then figure out how to get my team using them -- especially in a crisis. I want to know how other large networking environments use them.
No comments:
Post a Comment