I’m not keeping this up to data and all sort of awesome things happen on Azure like they are now allowing some of the stuff this talks about and have made some real advancements in performance so - THIS IS OLD don’t rely on it, hugs and kisses x x x
Azure Premium Managed Disk IO Latency varied between 2.7 milliseconds and 7.2 milliseconds on average. There were some really slow requests due to disabled anti-virus (windows defender) and some other unidentified driver issues but on the whole, Azure worked well and all 96% of requests were completed within 8 milliseconds.
What did I do?
I wanted to know how accurate short tests for a minute or so were, I also wanted to know how accurate perfmon was. The lowest time resolution in perfmon is 1 second and when issuing and measuring IO requests in milliseconds, 1 second is a long time (well 1000 milliseconds to be precise).
How well do those 1-second averages stand up to potentially thousands of requests per second?
How did I do it?
I used diskspd to generate a load of random reads, I throttled diskspd so we wouldn’t be hit by an Azure throttle. I just issued reads, random reads that would be spread out over the provisioned SSD, which as far as I knew could be 1 or a hundred physical disks.
I have recorded three sets of stats, I have the diskspd output, perfmon trace and an ETW trace of every single IO request over the 24 hours.
I understand that there are limitations to measuring IO from within the VM itself and these limitations show themselves time and time again in this post. It isn’t ideal and I would love to be able to record from the host but well, I can’t so let’s not cry about it for too long.
I disabled the cache on the drive so I would go back to the actual disk and I kept requests at 256K which is the magic number in Azure IO. If you get larger IO’s than 256K, then they cost more than one IO so you hit the throughput throttle sooner than you should.
I initially tried and failed to generate one massive 24-hour log file, I had one hilarious attempt at using procmon.exe to measure each IO, but that resulted in procmon.exe using 4 TB of virtual memory on a machine that had neither 4TB of RAM or swap space. I ended up settling on having 24 one-hour chunks of logs and combining them myself afterwards.
I captured the diskspd output, a perfmon with a resolution of 1-second and an ETW trace which covered every single IO request on the server. To gather the results, I filtered the ETW trace to just my disk and used wpaexporter to export the ETW traces as CSV files which I read in using python, created one massive (24-hour) dataset and then spat out a load of charts.
What are the results?
Most requests completed within 8 milliseconds, there were some regular periods of requests taking up to 500 milliseconds and one period, which I am pretty confident was caused by the anti-virus on the VM itself where requests took up to 3,000 milliseconds to complete.
If we exclude the slower requests then the majority of IO’s were completed between high 2 millisecond’s and low 7 milliseconds, the typical Microsoft advice is that requests take around 5-6 milliseconds, and if you just take then perfmon data then they do average out at about 5-6 milliseconds, but if you look at each underlying request you can see that the band of “normal” is more like 2.8 milliseconds to 7.2 milliseconds
What was the massive spike to 3,000 milliseconds?
This is the diskspd config I used:
diskspd.exe -b256K -d3600 -o32 -h -L -t1 -W -g31457 -w0 f:\data2.dat
The “F drive” is a single premium managed SSD in Azure with caching disabled.
The output from diskspd was:
|file||thread||bytes||I/Os||MiB/s||I/O per s||AvgLat||LatStdDev||file|
|%-ile||Read (ms)||Write (ms)||Total (ms)|
The second output table shows that we had some requests that took almost 3,000 milliseconds or 2.899 seconds to complete.
shows a single spike that goes up to 2.818 or 2818 milliseconds, if we zoom down to the minute or so when it happened we can see the spike goes to just under 3000 milliseconds:
First, we will add the current disk queue length counter to show us how diskspd is behaving, the way diskspd works in this configuration is that it allocates a set number of IO requests, in our case 32 (number of outstanding I/O operations: 32), then it posts the IO operations to windows using IO completion ports. As and when these IO’s complete diskspd receives an event to say that one of the completion ports has, well completed and so it then re-issues it to windows. Diskspd has a few small operations it performs before re-issuing the new request like checking whether it is within its own throttle limit and whether it should print a dot (.) to the console window, and then it re-issues the request.
If the current disk queue length stays low then it means that diskspd isn’t able to issue requests as fast as windows can complete them, if the queue goes up to 32 then diskspd can re-issue requests faster than diskspd can complete them, the ideal is a constant disk queue but a constant queue of 32, as in this case, shows that there is likely an issue.
Because we have applied a throttle to diskspd, the perfect case here is for diskspd to hit its own throttle and then the queue would be lower, and you can see this in the first part of:
Where until, about 16:53:45 the disk queue fluctuates but the “Disk Bytes/sec” stays consistent and the “Avg. Disk sec/Transfer” is low. After 16:53:45 we see the disk queue jump up to 32 so diskspd added 32 items to the queue but the IO requests on the queue were not completed.
If we had access to the hypervisor we could tell how long the request took to be sent from the VM to the storage and the reply received and we would know how long diskspd took so could tell where the issue was. We don’t have access to the hypervisor so will need a little guesswork to see what happened.
At this point we know, diskspd isn’t getting IO requests back to complete them, because the queue is staying high and not fluctuating, even slightly which diskspd would if it was draining and re-filling the queue. The latency of requests is increasing as measured by perfmon (“Avg. Disk sec/Transfer”) but that could include drivers blocking the IO requests. We also see that “Disk Bytes/sec” counter drops to almost nothing so we are unlikely being throttled outside of the VM, if that were the case we would still be seeing requests come back in, and the disk bytes/sec would stay consistent, but this is irregular and drops to nothing.
If we add “% Processor Time” we can see that there was a spike before and after our trouble spot:
We also see some reasonably nasty context switching:
I won’t take you process by process, but I found out which process decided to activate itself and make itself busy:
This is particularly frustrating as the windows defender anti-virus was disabled, I should have uninstalled it. If this was reproducible I could trace the system and see if MsMpEng had paused our requests but the fact that it was busy before and after our slowdown and if we zoom out, that 10 seconds is the only time that MsMpEng had any activity at all (including all 24 perfmon log files):
I would guess (no access to the hypervisor remember) that when scanning was disabled on a system, Windows defender still had a filesystem driver who received each IO event, looked at it, decided it was disabled and chose to ignore the event and pass it onto the next filter driver. I suspect that MsMpEng.exe asked the filter driver to do something which caused the driver to temporarily stop ignoring our requests and held onto them until it had finished doing whatever it was doing and could happily go back to ignoring our requests and passing them onto the next filter driver. In this instance, I don’t believe that it is fair to blame azure for me not uninstalling windows defender rather than just trusting that scanning being disabled would stop it interfering with my tests.
So what about those regular 500-millisecond spikes?
These are harder, and I wish I were able to measure the time to VM rather than the time to VM plus filter drivers, the answer is that I just don’t know.
If we look at the first hour we had 431,998 total reads, broken into bins we see:
longest request: 589,868 microseconds, or 589 milliseconds
- requests > 6 milliseconds = 28,228 = 6.53% wow
- requests > 10 milliseconds = 21,570
- requests > 15 milliseconds = 16,658
- requests > 20 milliseconds = 12,050
- requests > 25 milliseconds = 6,759
- requests > 50 milliseconds = 1,138
If we look at the screenshot:
We can see that the length of time perfmon recorded for the IO requests increased and then the disk queue length decreased to zero. The way diskspd works is that it issues some requests and as the requests are completed, it does some accounting and then issues a new request. If one request is completed, one request is re-issued, if ten requests compete, ten requests are re-issued. For the disk queue to drop to zero, one of these must-have occurred:
- diskspd received the notification and did not issue a new request
- diskspd did not receive the notification, but the request did complete (otherwise the queue wouldn’t have gone down)
- the VM received the completed IO but the notification was not passed to diskspd
So the Azure infrastructure must have been good, otherwise, the queue would have stayed higher.
If we accept that a higher response time, plus a zero length disk queue means that we have an issue with a driver on the VM or a problem with diskspd. It would be nice to say that if this was true, the opposite would also be true but as we have seen with the 3000-millisecond spike, we could have an item in the queue being held up by a driver so even with a high queue, it doesn’t point conclusively to Azure.
If we do look for high queues and high response times, I don’t see this in many cases, certainly not the fairly regular 500-millisecond spikes but I do see a few spikes around 300 milliseconds, take a look at this:
Here we see that the average disk transfer increases, the disk queue increases to the limit that diskspd has, so diskspd is able to issue all the requests it needs to, this happens for a little while and then everything snaps back into life. If we take a detailed look at the requests in that time we can see that there are only a few outliers in that time, most requests complete quickly, just a couple are held up which blows the average.
So, in summary, was perfmon accurate and how consistent were the IO requests over the Azure infrastructure to the premium managed disk?
Is perfmon accurate?
Not really, the averages were actually between around 2.8 milliseconds and 7.2 milliseconds whereas perfmon reports the average around 5-6 milliseconds.
How consistent were the IO requests?
From the Azure infrastructure point of view, they were pretty consistent, there are quite a few issues with running Windows in a VM in Azure if you care about consistent performance such as disabled anti-virus software causing you 3000-millisecond delays and fairly regular 500-millisecond delays. Apart from the odd 300-millisecond delay which could potentially be down to the Azure infrastructure, I would say it was pretty reliable.
Out of a total of 9,935,896 requests, 9,501,714 finished in less than 8 milliseconds, that is 96% of requests in less than 8 milliseconds and by rights that number should be higher as we know that the tail end of the requests were not caused by the Azure infrastructure but by the drivers and software on the VM.
What would I do next time?
I think that tracing for 60 seconds misses lots, tracing for 24 hours didn’t really capture anything that tracing for 60 minutes didn’t capture.
From now on my tests will be around an hour and I will uninstall anti-virus.
I am particularly excited by the news that azure have a new ssd offering that includes sub-millisecond latency and I will be testing with that when it is available. I would also like to see if I can get a stripped back windows, possibly nano running where I can see if I can smooth those peaks to see if we can make the disk access more reliable.