Azure Java Storage Sdk Upload Large File

Robin Shahan continues her series on Azure Blob storage with a swoop into uploading large blobs, including pausing and resuming.

2187-just_azure.svg

In the previous article in this series, I showed you lot how to use the Storage Client Library to practice many of the operations needed to manage files in blob storage, such as upload, download, copy, delete, listing, and rename. The CloudBlockBlob.UploadFile works fine, but it can be tuned for special cases such every bit very tiresome net admission.

When I worked for a startup, one of the things our desktop production did was upload a bunch of images and an MP3 file to Azure hulk storage. The MP3 could be as big every bit xx MB. Many of our customers lived in areas with broadband upload speeds of 1.0 mbps on a skillful day. When nosotros tested using File.Upload on a 20MB file with minimal broadband speed, we found the upload would fourth dimension out and eventually fail. It just couldn't send up plenty bytes and go a handshake dorsum rapidly enough to exist successful.

In gild to make our product work for all customers, nosotros changed the upload to transport the file upwardly in blocks. The customer could set the block size. If the client had pretty good internet speed (5 mbps or higher), they might set up the block size as high as 1 MB. If they had pretty bad internet speed (one mbps or lower), they could ready the cake size as low as 256kb. This is small enough for a block to be uploaded and the handshake completed, and then it could start on the next cake.

In this commodity, I'm going to hash out two ways to upload a file in blocks. One way is to use the parameters that can be changed when calling the UploadFile method on the CloudBlockBlob object. The other way is to programmatically break the file into blocks and upload them 1 by 1, then ask Azure to reassemble them.

Permit's beginning with using the congenital-in functions for uploading a file. I messed around with this a flake back in 2010-2011, but the properties as used back then are obsolete, and/or have been moved to dissimilar objects of the Storage Customer Library since and so. Bing-ing "SingleBlobUploadThresholdInBytes" only returned 8 manufactures. (Think about that. What have you searched for lately that only returned 8 results?) Near of the articles were from 2010-2011; the others were from MSDN, which offered a useful explanation like this: "This is the threshold in bytes for a single blob upload". Wow, incredibly helpful.

I managed to track down someone on the Azure Storage squad at Microsoft to help me empathise this, so at the fourth dimension of this writing, I call back simply three people in the world know how to use this correctly – me, the guy at Microsoft who owns information technology, and one of the other Azure MVPs. And then after you read this, you lot will be part of a very aristocracy group.

In that location are three properties directly involved.

SingleBlobUploadThresholdInBytes

This is the threshold in bytes for a single blob upload. (Haha! Kidding!) This setting determines whether the hulk volition be uploaded in one shot (Put Hulk) or multiple requests (Put Block). Information technology does non decide the cake size. Information technology basically says "if the file is smaller than this size, upload it as one block. If the file size is larger than this value, pause it into blocks and upload information technology."

The minimum value for this is 1MB (1024 * 1024). This means you lot can non use this to clamper files that are smaller than 1 MB. ParallelOperationThreadCount must be equal to 1 (more on that below). Likewise, this works with the Upload* API'south (such as UploadFile) but not to hulk streams. If you use OpenWrite to get a stream and write to it, it will always be uploaded backside the scenes using Put Block calls.

This property is found in the BlobRequestOptions class. To utilise information technology, create a BlobRequestOptions object and then assign information technology to the CloudBlobClient's DefaultRequestOptions holding.

StreamWriteSizeInBytes

This sets the size of the blocks to utilize when you do a Put Blob and it breaks it into blocks to upload because the file is larger than the value of SingleBlobUploadThresholdInBytes.

By default, this is 4MB (4 * 1024 * 1024).

This is a property on the CloudBlockBlob object or CloudPageBlob object, whichever you are using. You can use this when streaming files upward to Azure also (like when you're using UploadStream instead of UploadFile).

ParallelOperationThreadCount

This specifies how many parallel PutBlock or PutPage operations should be pending at a time.

If this is prepare to anything but 1, SingleBlobUploadThresholdInBytes will exist ignored. After all, if you lot inquire the file to be sent up in multiple threads, at that place's no fashion to do that but to send it up in blocks, correct?

This is a holding of the BlobRequestOptions object.

All together now

So for case, if you use these values:

  • ParallelOperationThreadCount = 1
  • StreamWriteSizeInBytes = 256 * 1024 //(256 kb)
  • SingleBlobUploadThresholdInBytes = 1024 * 1024 //(1 MB)

and phone call hulk.UploadFile, if the file is less than 1MB, it will employ i Put Blob to upload information technology. If the file is larger than 1 MB, it will split it into 256kb blocks and send the blocks up equally multiple requests.

Yous might besides consider changing the default Retry Policy. If you're chunking the file because you think the client volition have issues uploading it considering their internet connection is poor, you might want to fix it to only retry once, or not at all. Otherwise it may time out, then await X seconds and time out once more, etc, when information technology will never succeed. For this reason, I'chiliad merely having it retry once in the code below.

Uploading a file using the .NET Storage SDK

In the code to a higher place, you can see that I create a BlobRequestOptions object, assign the values of SingleBlobUploadThresholdInBytes, ParallelOperationThreadCount, and RetryPolicy. And then after instantiating the CloudBlobClient, I set the DefaultRequestOptions to my BlobRequestOptions object. After getting a reference to the blob, I set up the StreamWriteSizeInBytes. And then I upload the file.

If I turn fiddler on and apply the code above to upload a 5MB file, I encounter multiple requests – one for each block. These calls are made consecutively because they are all running in a single thread (ParallelOperationThreadCount = one).

2224-BlobStorage_4_Figure_01_FiddlerView

Figure 1: Fiddler View

And if I look at any one line, I tin run across the size of the request. For all just the concluding two, the block size is the aforementioned every bit StreamWriteSizeInBytes. The terminal 2 transport out the remainder of the blocks.

2224-BlobStorage_4_Figure_02_FiddlerDeta

Effigy ii: Fiddler Details

Upload a file in blocks programmatically

If yous can ready a couple of properties and upload a file in blocks easily, why would you want to do it programmatically? The case that immediately comes to mind is if you have files that are less than 1 MB and you want to send them upward in 256kb blocks. The minimum value for SingleBlobUploadThresholdInBytes is one MB, so you can not utilize the method above.

Another case is if you lot want to permit the user break the upload procedure, then come up back later and restart it. I'll talk about this after the code for uploading a file in blocks.

To programmatically upload a file in blocks, you kickoff open a file stream for the file. Then repeatedly read a block of the file, set a block ID, calculate the MD5 hash of the cake and write the block to blob storage. Keep a list of the cake ID'due south equally yous go. When you're washed, you telephone call PutBlockList and laissez passer it the list of block ID's. Azure will put the blocks together in the order specified in the list, and then commit them. If you become the Block List out of lodge, or yous don't put all of the blocks before committing the list, your file volition be corrupted.

The cake id'due south must all be the same size for all of the blocks, or your upload/commit will fail. I usually just number them from 1 to whatever, using a block ID that is formatted to a 7-grapheme string. So for 1, I'll become "0000001". Note that block id'due south have to be a base 64 string.

Here's the code for uploading a file in blocks. I've put comments in to explain what's going on.

Y'all can actually dissever the file upward and upload it in multiple parallel threads. For my use case (customer has insufficient net speed), that wouldn't brand sense. If he can't upload chunks bigger than 256MB, so he can't upload 2 or three or 4 of those at the same time. But if you have decent upload speed, you could definitely upload multiple blocks in parallel.

What if yous what to give the customer the ability to start an upload, terminate it, and resume information technology later? The customer is uploading a file with your application, and he hits pause and goes off to do something else for a while. When he hits suspension, you only stop uploading the file. When he comes back and asks to resume the upload, call to go a list of the uncommitted blocks that have been uploaded and put each blockListItem.Proper noun in a List<string>. Start reading the file from the beginning. Read each block in and create the blockID the aforementioned way you created it before. Add this to the list of blockIDs that you are going to use to commit all the blocks at the cease. See if the blockID is in the list of uncommitted blocks. If it is, remove information technology from the list of uncommitted blocks because you've establish it, and won't find information technology again, so why carp leaving it in the search list? If the blockID is non in the list of uncommitted blocks, call PutBlock to write the block to Hulk Storage.

After reading the whole file and putting all of the missing blocks, call PutBlockList with the list of blockIDs to commit the file.

This is pretty close to the same code every bit above, except it calls to get the list of uncommitted blocks, and does the check to meet if the block is already committed earlier writing the block.

Instead of requesting a list of committed blocks from blob storage, y'all could keep track of the listing on your ain and store it somewhere on the customer'southward computer. I'd rather query hulk storage, it feels safer somehow because the list tin't be accessed by the customer. (Information technology is, after all, his computer).

Another consideration you lot might think about is if the file the customer is uploading can be changed between the time he starts the upload and the time it finishes. When I used this upload method, I was taking a bunch of images and an mp3 file and creating a goose egg file with a unique proper noun and uploading the nothing file. The customer could observe the nada file on the computer and mess with it, but it was extremely unlikely. Besides, if the customer created another cypher file, information technology would be queued after the first one, and beginning uploading after the first upload finished.

Y'all tin can upload some blocks, wait a couple of days, upload some more than blocks, look another couple of days, etc. Uncommitted blocks will be cleared automatically later on a week unless you add together more blocks to the same blob or commit the blocks for the blob. Here'due south the code you lot can use to call up the listing of blocks; the impress statement shows you the members you can admission for each block, and you tin can see the blockListItem.Name and the property telling if information technology'south a committed block.

To examination your code, you tin can run the regular upload in debug and stop it when it gets past a handful of blocks, and so run the routine that checks for the cake status, uploads the rest of the blocks, and commits all of the blocks.

One thing to note: Y'all tin add the code to become the blocks and do the check to see if they are committed earlier uploading them to your Upload routine. I chose non to exercise this, but to use an almost-identical copy with those lawmaking $.25 added in, because retrieving the list of blocks will take some small usage of the performance, and then I only want to incur that hit when I know there is a possibility that the file has been stopped and needs to be restarted.

Summary

In this article, I showed you lot how to go were few men/women take gone before by using the properties of the CloudBlobClient and CloudBlockBlob to allow Azure do the difficult work of uploading a file in blocks for you lot. I also showed you how to programmatically do that yourself, in case you lot want to stop in the eye and proceed afterwards. In the next post in this series, I will evidence yous how to apply the Residual API direct to access blob storage, including running the Balance calls in PowerShell. (Oooooh, aaaaah.)

rhodestabled1945.blogspot.com

Source: https://www.red-gate.com/simple-talk/cloud/platform-as-a-service/azure-blob-storage-part-4-uploading-large-blobs/

0 Response to "Azure Java Storage Sdk Upload Large File"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel