Azure has two API’s which on the surface do very similar things. You give the API’s a source, and then they copy the data into the destination storage account (blob storage account).
Copy Blob can only copy from one storage account to another, but it can copy blobs or files. The
Copy Blob process is asynchronous, so when you make the call to
Copy Blob you get a status and a request id back. You then need to poll and check on the status of your copy operation.
The response to a Copy Blob request include these two headers:
x-ms-copy-id- to find out the status of the request you pass this onto
Get Blob Properties, you can also abort the request using
Abort Copy Blob
x-ms-copy-status-this shows whether the status is
pending. I have only ever seen a pending or no response when there was an error returned. The API documentation says that you can get a copy status of
successwhich means it has already been completed.
If you get a copy status of
pending you can call
Get Blob Properties, and there are four interesting headers returned:
x-ms-copy-id- this tells you whether the current copy operation or the last completed operation was your
copy idthat was returned from
Copy Blobrequest. If this is empty, then your request was either completed or aborted and something else such as a
Put Blob List(we’ll come back to this later) has happened since your request stopped. There is no way to tell whether your request completed successfully or not. If something else is busy doing something then better leave them to it :)
x-ms-copy-progress- if a
Copy Blobis in progress then this tells you how far you have left to go
x-ms-copy-source- the source url for the copy
The thing with the
Copy Blob call is that someone can abort your request, or it can complete successfully but then someone can either start a new
Copy Blob request, wiping out your history or they can use a
Put Block or
Put Block List to make a change which again, wipes out your history.
If you were designing a solution to copy blobs about you would undoubtedly want to know which request should be allowed to complete and which ones should be aborted. We get some help because along with the copy information,
Get Blob Properties includes the MD5 of the file so if we want to ensure a blob is replicated successfully we can use that to see if it has the value we expect or we can do something about it.
If you need to monitor lots of
Copy Blob requests then each
Get Blob Properties call counts as one transaction so having a loop of a few milliseconds might get you rate limited.
When it comes to security with
Copy Blob there are some great options, you can use a SAS key on the url for either the source or destination or if you are copying a blob within a storage account you can use a shared key. I have used the account key for the source and SAS key for the destination, but ideally, I would use the SAS key for the source and the destination.
Put Block From URL
Put Block From URL wins the award for the driest sounding storage API but it is pretty cool. You can give it a source URL (HINT could be an s3 bucket or a web page somewhere), and it will copy the data into blob storage on your storage account. Wowsers, I’m in love :)
Because the source for
Put Block From URL can be any URL a blob in a storage account can be the source!
To use this API you send a PUT request with an
x-ms-copy-source with the URL to copy data from. If it is a storage account, then it either needs to be available anonymously or you include the SAS key in the URL.
When you send the request you can also include an optional timeout in the querystring
timeout=20. The timeout is in seconds, and it is important as the request is synchronous. I haven’t tried using this with a large file that takes a long time to copy, and I wouldn’t expect the TCP/HTTP channel to stay up for days for example. That being said the maximum blob size is 4.75 TB so as long as the source system wasn’t terribly slow, you should get the data within a reasonable amount of time.
If you were copying small blobs then a synchronous call could work well for you, nodejs would be happy with it.
Once your request finishes you are not quite done yet, the blobs copied from URL will not be committed until you (or some other kind soul) calls
Put Block List on the new blob. Once the
Put Block List completes then you can go home for the day and tell your family how exciting the Azure storage API’s are.
Which one is better?
Copy Blob because it feels more logical to me, you create a request and then you wait for it to finish. With
Put Block From URL you have a lot of flexibility You can get data from anywhere, but I find the whole “do this”, “now commit it” a litter onerous, give me a single operation and I’m sold.
It is worth pointing out that not all of the SDK’s have
Put Block From URL, the .net sdk only had it added recently (https://github.com/Azure/azure-storage-net/blob/master/changelog.txt) in version 9.3.0. The nodejs sdk doesn’t have it and who knows what the java sdk has, we can’t get past the BlobServiceAbstractFactoryFactory to know what the hell it does!
Put Block From URL https://docs.microsoft.com/en-us/rest/api/storageservices/put-block-from-url
Get Blob Properties https://docs.microsoft.com/en-us/rest/api/storageservices/get-blob-properties
Introducing the async copy blob (way back in 2012!) https://blogs.msdn.microsoft.com/windowsazurestorage/2012/06/12/introducing-asynchronous-cross-account-copy-blob/