So, Netflix has made a program called "Charos Monkey", which randomly disables servers in their data center. They then 'upgraded' it with some other programs they call the Simian Army that do other things.
The goal of the Simian Army is to:
- First, can your infrastructure handle the losing of $thing? If not, this is REALLY important to fix.
- Second, do the users even NOTICE that $thing went down, even if it continued to function (perhaps its degraded)? If they DO notice, then you need to add more of $thing, or increase redundancy.
- Third, does your team received a monitoring alert that $thing went down? If not, then you need to step up your monitoring game.
So, if you were to have a Chaos Monkey / Simian Army on your network, what would you task it to do?
No comments:
Post a Comment