Fragmentation of Azure Blob Storage in Data Upload and Download Scenarios

Introduction

If you’ve worked with Microsoft Azure cloud technologies, you’ve likely come across — or at least read about — Azure Storage Accounts and their components: Tables, Queues, and Blobs.

In this article, I’ll focus on blobs, particularly from the perspective of access speed, and review the possibilities of modifying them without downloading the entire blob content to the client.

Currently, Azure provides three types of blobs:

  1. Block Blobs store binary data in separate variable-sized blocks and allow up to 190 TB of total data in a single blob.
  2. Append Blobs are essentially block blobs, but Azure Storage infrastructure takes responsibility for appending data to the end of an existing blob. They allow multiple independent producers to write to the same blob without locks (though without order guarantees — only the guarantee that each append will succeed without overwriting others).
  3. Page Blobs provide random access to content and are mainly used to store virtual machine images.

Most references to blobs usually mean block blobs. I’ll take the same approach, touching only briefly on append blobs, since page blobs are primarily used by Azure itself and I’ve not had to deal with them directly in practice.

What Could Be Simpler Than a Blob?

A few years ago, on one of our projects, we needed to ingest, store, and expose large volumes of telemetry data via an API. Using tables was inconvenient for several reasons.

  1. Reading performance over long time ranges was insufficient. Even daily partitioning and parallel queries per day didn’t give us the speed we needed.
  2. Hardware costs for our API services became excessive due to JSON serialization of Table API responses.

We considered Cosmos DB but found it far too expensive for our volumes. That’s when we discovered Append Blobs, which had just become generally available. Microsoft recommended them for log and journal scenarios, with built-in support for concurrent, lock-free writes. Perfect, right?

Initially, things looked good. Data streamed into blobs quickly, API queries performed faster than against Azure Tables, and protobuf binary serialization saved CPU resources.

But as blob sizes approached expected daily volumes, performance collapsed. Reading from blobs became painfully slow: even a 10 MB blob could take minutes to read!

The analysis revealed the issue: fragmentation of append blobs.

Every append operation creates a separate block. With concurrent inserts, Azure commits blocks but doesn’t group them. If blocks are very small and numerous, read speed plummets catastrophically.

Tip: If you’re using append blobs for journaling and later need to load them into infrastructure like ELK, set the largest possible logger buffer size (tolerating some data loss) to reduce fragmentation.

Block Blobs: Replacing Append Blobs

Back to block blobs — and how we solved our telemetry issue.

Azure’s documentation shows a simple API for uploading a blob in a single call. The limit is currently ~5 GB (previously 256 MB before 2019, 64 MB before 2016).

Beyond this simple API, there’s an advanced one with three operations: Put Block, Put Block List, and Get Block List.

  • You can upload blocks up to 4 GB in size (100 MB before 2019, 4 MB before 2016).
  • Each block has a unique ID (≤64 bytes).
  • After uploading blocks, you call Put Block List with their IDs to commit and expose the blob.

This unlocks several options:

  • Parallel uploads: Split large data into blocks and upload concurrently, achieving nearly 10× speed-up with minimal code changes.
  • DIY append blobs: Maintain the last “tail block” in your service. When appending, merge new data with it, upload as a new block, and commit. No reads needed. Fragmentation remains low.
  • More flexibility: You can replace any block in an existing blob, not just the last one. Blocks can vary in size (up to 4 GB), enabling merging/splitting strategies.
  • Metadata in block IDs: With 64 bytes available, only a couple are needed for the actual ID (since there are ≤50,000 blocks per blob). The rest can store small read-only metadata.

Fragmentation, Upload, and Download Speed

So, is fragmentation always bad? Not necessarily.

For example, when importing external data into Azure Tables, it may be faster to pack it into a blob, upload it into the same datacentre as the table, and then load it into the table from there. In such cases, blob fragmentation (within reason) isn’t the bottleneck — table insert speed is.

Here’s a sample extension method for parallel block uploads:

public static class BlockBlobClientExtensions 
{
public static async Task UploadUsingMultipleBlocksAsync(this BlockBlobClient client,
byte[] content, int blockCount)
{
if (client == null) throw new ArgumentNullException(nameof(client));
if (content == null) throw new ArgumentNullException(nameof(content));
if (blockCount <= 0 || blockCount > content.Length)
throw new ArgumentOutOfRangeException(nameof(blockCount));
var position = 0;
var blockSize = content.Length / blockCount;
var blockIds = new List<string>();
var tasks = new List<Task>();
while (position < content.Length)
{
var blockId = Convert.ToBase64String(Guid.NewGuid().ToByteArray());
blockIds.Add(blockId);
tasks.Add(UploadBlockAsync(client, blockId, content, position, blockSize));
position += blockSize;
}
await Task.WhenAll(tasks);
await client.CommitBlockListAsync(blockIds);
}
private static async Task UploadBlockAsync(BlockBlobClient client, string blockId,
byte[] content, int position, int blockSize)
{
await using var blockContent = new MemoryStream(content, position,
Math.Min(blockSize, content.Length - position));
await client.StageBlockAsync(blockId, blockContent);
}
}

This ensures atomic blob creation from multiple blocks and helps maintain sequence integrity. It’s especially useful when uploading to geographically distant datacentres with high TCP latency.

Just remember: don’t exceed Storage Account limits, or you’ll hit throttling.

A Few Tests

We tested the impact of fragmentation on download speed and the effect of parallelism on upload speed.

  • Test blob size: 100,000,000 bytes.
  • Split into 1, 100, 1,000, 10,000, or 50,000 blocks.
  • Upload to Azure Storage, then download and delete.
  • Measure upload and download times.

Test 1: Azure North Europe (client in Moscow)

| Blocks | Block Size (bytes) | Upload Time (s) | Upload Speed (KB/s) | Download Time (s) | Download Speed (KB/s) |
|--------|---------------------|-----------------|----------------------|-------------------|-----------------------|
| 1 | 100,000,000 | 19.32 | 5,054 | 28.90 | 3,379 |
| 100 | 1,000,000 | 3.81 | 25,631 | 38.49 | 2,537 |
| 1,000 | 100,000 | 6.00 | 16,276 | 42.16 | 2,316 |
| 10,000 | 10,000 | 7.27 | 13,432 | 127.73 | 764 |
| 50,000 | 2,000 | 31.97 | 3,054 | 394.86 | 247 |

Test 2: Azure West US (client in Moscow)

| Blocks | Block Size (bytes) | Upload Time (s) | Upload Speed (KB/s) | Download Time (s) | Download Speed (KB/s) |
|--------|---------------------|-----------------|----------------------|-------------------|-----------------------|
| 1 | 100,000,000 | 51.48 | 1,896 | 80.00 | 1,220 |
| 100 | 1,000,000 | 8.96 | 10,899 | 96.00 | 1,017 |
| 1,000 | 100,000 | 3.48 | 28,062 | 105.00 | 930 |
| 10,000 | 10,000 | 2.67 | 36,575 | 230.00 | 424 |
| 50,000 | 2,000 | 9.82 | 9,944 | 770.00 | 127 |

From ~1,000 blocks onwards, upload times increase due to API throttling — not something you’d want to hit in production.

Conclusions

When working with large mutable blobs, it’s crucial to control fragmentation. Otherwise, access speeds may degrade so much that your data becomes practically unusable.

If you face such issues, you may need to change your application logic or periodically “heal” blobs by rebuilding them with larger blocks. Azure APIs allow in-place block merging, even with optimistic concurrency to avoid losing consistency.

For data import into Azure, moderate fragmentation (100–1,000 blocks, with controlled parallelism) can actually improve upload speeds by an order of magnitude, while the ~25% loss in later read speed is an acceptable trade-off that requires minimal code changes.

Learn more Fragmentation of Azure Blob Storage in Data Upload and Download Scenarios

Leave a Reply

Your email address will not be published. Required fields are marked *