GRS - Globally redundant storage is a setting you can apply to a storage account in Azure that means any data you commit to storage is replicated, behind the scenes, to an alternative Azure Region.
The idea behind this is that if Azure loses an entire region, Microsoft will fail over to the partner region where you can get to your data and carry on as before.
The problem with this approach is that for the secondary GRS site to become enabled, Microsoft would have had to have lost an entire region, including multiple data centres and have had enough time to decide that they were not going to be able to recover any of the data centres and so fail-over to the secondary region.
The issues with this approach are:
- Microsoft does sometimes get multi-hour/day downtime, and they don’t see that as sufficient reason to failover. We can see by this recent reddit thread that users can and do lose access to GRS storage accounts: https://www.reddit.com/r/AZURE/comments/8sd5z8/microsoft_azure_north_europe_is_down_for_4_hours/
- Microsoft choose the partner region, what if you the secondary zone for region is a region that doesn’t suit you? What if you have your infrastructure deployed to a separate region so when the GRS failover completes, how long will a manual copy to your actual secondary zone take?
- Azure often suffer from capacity issues (https://www.reddit.com/r/AZURE/comments/8p3awd/azure_us_east_sorry_clouds_full/). If Microsoft loses an entire region, will the sister region be able to support the capacity from the failed region? Would you wait until Microsoft declare the region gone and then struggle with everyone else to restore?
- If Microsoft loses an entire zone, it is likely to be more of a technical “glitch” than an environmental issue so what are the chances of Microsoft losing one region but not all regions, or at least your primary and, Microsoft chosen, secondary?
GRS is good if you need to KNOW that your data is recoverable, but if you need that data quickly, then it is not a valid solution.
So what about RA-GRS?
RA-GRS is better, plus you get given a read-only secondary domain you can read from at any time. The issues I have with RA-GRS are:
- Microsoft chooses where the secondary region is and if it doesn’t happen to be your secondary region then bad luck.
- It isn’t possible to make the secondary, read/write so if you decided to fail-over yourselves and use it you would need to copy the data into a 3rd storage account. If you have TB’s of data - how long will that take?
- If Microsoft loses all regions which is more likely than losing one region, any data in Azure is inaccessible.
If you need to be able to control where and when you can access your data, then GRS isn’t for you and likely only gives you a false sense of security.
Why is losing everything in Azure more likely than one region?
There is a distinction between losing for a few hours/days and losing permanently. For Microsoft to failover a region and make the replicated copy of GRS active they have to deem the region to be unrecoverable.
A region in Azure is redundant - they have multiple data centres, each data centre has multiple routes and clusters etc. To have a data centre unrecoverable there would likely be a fire that destroyed everything in that data centre, a physical event. If there were a physical event that destroyed a single data centre, Microsoft wouldn’t perform a region failover because they have redundant facilities in each region.
To have a failure in all of their data centres in a region would either need to be a physical event that destroys an entire city or a political event. If, for example, a meteorite hit north europe that was catastrophic enough to destroy an entire city or set of data centres then holland and ireland are not that far apart, what are the chances that Microsoft will have anywhere left to failover to? If there was a political event, a war for example and the government of the host nation shut down access to the data centre then regions, where the partner is within the same country (east us + west us), are in trouble because they would lose all data centres in that country.
If we ignore events physical or geopolitical for a moment and look at software bugs, for a deployment to be rolled out that made an entire region permanently unrecoverable I am not sure that it would be possible. If a new update were rolled out that broke a data centre I think Microsoft would likely stop the rollout to other data centres, let alone, other regions. If the bug was dormant for a period of time, then I think the worse case scenario is they send someone round with a USB disk and re-flash a load (well lots and lots and lots) of hardware - pretty bad but doable and recoverable from.
So why is a global failure more likely? For a region to become unavailable is almost unthinkable, if we look at the global level and ignore every data centre being hit by a physical event because if there was then no one is going to care about their funny cat pictures. We start to look at Microsoft itself, what happens if it is found that they copied linux source code into windows and are given a “cease and desist” by a judge and shut down their data centres. In their rush to be the best cloud provider, the board is found to have been fraudulently accounting, and they end up filing for bankruptcy protection, and the auditors start shutting down their systems to save money. I know these are unlikely, but this would have to be the cause of the Azure data centres being unavailable and hopefully more likely that a massive physical event that destroys countries.
“What to do if an Azure Storage outage occurs”:
“Geo-redundant storage (GRS): Cross-regional replication for Azure Storage”:
Thanks to mtjerneld https://www.reddit.com/user/mtjerneld on reddit for pointing out that although microsoft chooses the partner region they are well defined:
Thanks to dweinst https://www.reddit.com/user/dweinst on reddit for asking to clarify why losing everything is more likely than just a region
I’m really enjoying the reddit feedback system for this!