For about two years I have been talking about IO limits in Azure in terms of two buckets or channels. The first is the cached bucket and the second is the uncached channel. When you choose a VM size, you get a throughput limit for the VM, ignoring standard VM’s and just looking at premium VM’s you get this cached/uncached limit.
The documentation quotes these two limits as IOPS and MB/sec. They aren’t always proportional, some of them seemingly, wonderfully weird and random.
The way it works is that you have these two channels and if nothing is served from cache, you get a combination of both buckets so you can add any number of disks and use these two buckets to provide the best peak performance.
If anything is served from cache then it doesn’t count towards your throttle limit, so if you have something like SQL Server that writes and then reads straight back you could get higher than even the combination of both of the throttles.
Well, that is what I think anyway from running lots of tests and monitoring a SQL Server running on IaaS that uses a lot of IO. The problem is that none of the documentation from Microsoft says this, so I woke up yesterday in a little bit of a mad panic - what if my tests were wrong and I am not right? What if the max throughput is the uncached throughput, but I was using the cache, so I never realised it??
Time to spin up the Azure portal and get some disks attached to a VM for some testing :)
What did I do?
I spun up a small VM that wasn’t ridiculous, a DS_1_v2 has these IO limits:
The thing I don't quite get is why does the uncached bucket allow higher IOPS but a lower throughput.
So, lets first define the questions:
- What is the max throughput we can get on the VM, is it 32 MB/sec, 48 MB/sec or higher when we bypass the cache?
- Can we get higher throughput than 32 MB/sec if we get everything from cache?
What is the max throughput we can get on the VM, is it 32 MB/sec, 48 MB/sec or higher when we bypass the cache?
So, the first question, if we have an application that absolutely cannot use the cache, or it wants to use a cache but it keeps missing, what is the max throughput we can get?
For the first test, I used two instances of diskspd, disabled all caching and also throttled each one so they didn’t try to exceed their own throttles. The F drive is the cached bucket (32MB/sec throttle) and the G drive the uncached bucket (48MB/sec throttle).
The total throughput across the two drives is ~60 MB/sec which is much higher than the highest of the two throttles (48MB/sec). This means that on this VM and every VM I have tested on (GS, DS, M series VM’s) the throttle is a combined throttle so you can use each bucket at the same time and get a higher throughput than Microsoft suggest if you can allow some of your traffic to not use caching. This means that the requests will be slower but they will be consistent (as consistent as azure gets)
What do Microsoft say?
They don’t say this, and they say that there is one limit which is the uncached limit: https://blogs.technet.microsoft.com/xiangwu/2017/05/14/azure-vm-storage-performance-and-throttling-demystify/
In this he used diskspd to mix cached and uncached disks to get to the uncached throttle limit:
- diskspd -b256k -d30 -o10 -t1 -Sh -w100 -L -Z1G -c2G e:\io.tst.bin
- diskspd -b256k -d30 -o10 -t1 -Sh -w100 -L -Z1G -c2G f:\io.tst.bin
- diskspd -b256k -d30 -o10 -t1 -Sh -w100 -L -Z1G -c2G g:\io.tst.bin
- diskspd -b256k -d30 -o10 -t1 -Sh -w100 -L -Z1G -c2G h:\io.tst.bin
- E: => cached 110M
- F: => cached 109M
- G: => uncached 80M
- H: => uncached 87M
- VM uncached limit 384M (110+109+80+87 = 386)
This is not what I get, I smash past the uncached VM limit even with caching disabled?
How about the official documentation?
Each high scale VM size also has as specific Throughput limit that it can sustain. For example, Standard GS5 VM has a maximum throughput of 2,000 MB per second.
But the GS 5 has an uncached limit of 2,000 MB/sec and a cached throughput of 1,800 MB/sec when it misses the cache. I have got perfmon charts showing this somewhere.
I have gone back and checked, and with caching disabled, I can get higher than the max throughput to a VM, so it is a combined throughput.
That being said, if it is undocumented then it could change so maybe there will be a patch to slow down the max throughput of disks, I hope not :)
Can we get higher throughput than 32 MB/sec if we get everything from cache?
I thought serving from cache would be good, but I didn’t realise how good. I changed the test, so I just sent reads of that magic IO size in Azure of 256KB, I didn’t throttle the requests, just pushed the turbo button.
With caching enabled, a small file, so it fits in the DS_1_v2 cache of 43GB, and I got a throughput rate of over 2,000 MB/sec - WOW!
The latency was also pretty outstanding with an average of 3 milliseconds.
If you have an application that needs to read and the data is stored in that local cache, every VM in Azure has then you could get some pretty stunning figures from a machine that costs about 9p an hour.