Yes. I know. I committed to posting at least one post per week, and I’ve failed. However, I am going to try to make it up by posting two articles this week. First off, some simple thoughts on Amazon’s “HQ2”, fault tolerance, and risk management. Last year Amazon started the process to find a suitable location for a second headquarters that they are calling “HQ2”. This is a pretty unique scenario and it got me to thinking. While some news articles have called it a “marketing ploy” to attract tax incentives and offers from regions, I’m taking a different (albeit probably wrong) angle that while their choice to be so public about the selection process most likely is a marketing move, the underlying thought could be to introduce some redundancy and fault tolerance into their headquarters operations as a method of risk management. As Sancho Panzo said to Don Quixote: “…to retire is not to flee, and there is no wisdom in waiting when danger outweighs hope, and it is the part of wise men to preserve themselves to-day for to-morrow, and not risk all in one day;“ and I thought this could be a good time to talk about risk management in general.
According to NIST risk “…is the net negative impact of the exercise of a vulnerability, considering both the probability and the impact of occurrence. Risk management is the process of identifying risk, assessing risk, and taking steps to reduce risk to an acceptable level. ” which in human terms means that risks are bad things that can happen to your stuff and that to avoid or mitigate risk to an acceptable level, you want to identify your stuff (assets or operations), the bad things that could happen to that stuff (hazards or vulnerabilities), a scenario for that bad thing occurring, the probability of that scenario occurring, and finally the severity of that scenario playing out (what could happen, how likely is it to happen, and what is the final impact). Based on that information you can determine if the cost of a mitigation is worth the investment in that mitigation as the cost to mitigate a risk shouldn’t exceed the value of the asset itself. For example, let’s say a business is based in a wooden building (our asset). One potential hazard for that building is a fire and one potential scenario (there could be multiple) is a total loss of the building. If in our scenario we have an employee who likes to play with matches and lighter fluid, then we have a higher likelihood of that scenario occurring and this could have a catastrophic impact on our business (all our stuff got burnt up). However, if we believe that the investment in a fire suppression system is worth the expenditure, we could mitigate this risk by installing sprinklers (or just replace Jane Firebug, but that’s for HR to decide). Another potential mitigation could be to duplicate operations so that if we did have a fire (Jane didn’t get replaced because she wrote all the code for the companies primary product), we could resume operations in another location. This would deal with not just the risk of fire, but also other potential risks to the building. Granted this could be a much more expensive mitigation than sprinklers but it can also be more effective in terms of operational recovery, uptime, etc. Keep in mind that hazards aren’t just physical in nature. A hazard (or threat) could be that a vulnerability in software which could exploited by a malicious actor. While it’s a different type of hazard, the process for assessing the risk is the same: identify the threat, a scenario for that threat, it’s likelihood of occurring, and finally its impact. From there you can then determine the acceptable risk or mitigation.
If we consider that Amazon has said that it will be “…a complete headquarters for Amazon – not a satellite office. and that they are expecting to allow their staff and executives to be able to choose which headquarters they want to work at, move groups around, etc. then that sounds to me like a basic fault tolerance design. In 1976, Dr. Algirdas Avizienis wrote that “The most widely accepted definition of a fault-tolerant computing system is that it is a system which has the built-in capability(without external assistance) to preserve the continued correct execution of its programs and input/output (I/O) functions in the presence of a certain set of operational faults.” (sorry, the original paper is behind a paywall ?) That certainly sounds like the concept that Amazon is taking here doesn’t it? (ok, roughly taking here. This isn’t a doctoral thesis folks, it’s just me musing on a Saturday morning). And while duplication of a system as a method of mitigating risk is an expensive one, I think Amazon has the cash.
Also published on Medium.