Robust Async File Downloader in Python with aiohttp

Introduction

Efficiently downloading large files in Python can be challenging, especially when you want to support asynchronous downloads, caching, file validation, and real-time progress feedback. In this blog, we’ll walk through building a production-ready async file downloader using the aiohttp library, with features like cache validation, file size and MD5 hash checking, and a customizable progress callback.

Why Choose aiohttp for Asynchronous Downloads?

aiohttp is my go-to library for asynchronous HTTP operations in Python. It’s fast, mature, and designed for non-blocking network tasks—making it ideal for:

  • Downloading large files efficiently
  • Handling multiple downloads concurrently
  • Integrating with modern async Python workflows

Key Features of Our Async File Downloader

Here’s what sets this downloader apart:

  • Asynchronous Downloading: Harness the power of async/await for non-blocking file transfers.
  • Smart Caching: Skip downloads if the file already exists and matches expected size or MD5 hash.
  • Robust Validation: Automatically check file size and MD5 hash after download to ensure integrity.
  • Custom Progress Callback: Get real-time feedback with a callback function for download progress.

Implementation Overview

Below is a streamlined version of the AsyncFileDownloader class. It’s designed for clarity and extensibility:

import aiohttp
import asyncio
import hashlib
import os
import time
class AsyncFileDownloader:
def __init__(self, output_dir="."):
self.output_dir = output_dir
os.makedirs(self.output_dir, exist_ok=True)
async def _md5sum(self, file_path, chunk_size=8192):
md5 = hashlib.md5()
with open(file_path, "rb") as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
md5.update(chunk)
return md5.hexdigest()
async def _validate_file(self, file_path, expected_size=None, expected_md5=None):
if not os.path.exists(file_path):
return False
if expected_size is not None and os.path.getsize(file_path) != expected_size:
return False
if expected_md5 is not None:
actual_md5 = await self._md5sum(file_path)
if actual_md5 != expected_md5:
return False
return True
async def download(self, url, filename=None, expected_size=None, expected_md5=None, callback=None, frequency=0.5):
if not filename:
filename = os.path.basename(url)
file_path = os.path.join(self.output_dir, filename)
# Cache validation
if await self._validate_file(file_path, expected_size, expected_md5):
print(f"File {file_path} already valid. Skipping download.")
return file_path
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
resp.raise_for_status()
total_bytes = int(resp.headers.get('Content-Length', 0)) or expected_size or 0
bytes_downloaded = 0
start_time = time.time()
last_callback = start_time
with open(file_path, "wb") as f:
async for chunk in resp.content.iter_chunked(8192):
f.write(chunk)
bytes_downloaded += len(chunk)
now = time.time()
if callback and (now - last_callback >= frequency or bytes_downloaded == total_bytes):
time_elapsed = now - start_time
callback(bytes_downloaded, total_bytes, time_elapsed)
last_callback = now
print(f"Downloaded {file_path}")
# Post-download validation
if not await self._validate_file(file_path, expected_size, expected_md5):
os.remove(file_path)
raise ValueError(f"Downloaded file {file_path} failed validation.")
return file_path

Let’s break down the workflow:

  1. Initialization: Set your output directory for downloads.
  2. Cache Validation: Before downloading, check if the file already exists and matches the expected size or MD5 hash.
  3. Async Download: If needed, stream the file in chunks and write to disk.
  4. Progress Callback: Receive real-time updates on download progress, bytes transferred, and elapsed time.
  5. Post-download Validation: After download, validate the file again. If it fails, delete and raise an error.

Example Usage

Here’s how you can use the downloader in your own projects:

import asyncio
def print_progress(bytes_downloaded, total_bytes, time_elapsed):
percent = (bytes_downloaded / total_bytes) * 100 if total_bytes else 0
print(f"Downloaded: {bytes_downloaded}/{total_bytes} bytes ({percent:.2f}%), Time elapsed: {time_elapsed:.2f}s")
async def main():
downloader = AsyncFileDownloader(output_dir="downloads")
await downloader.download(
url="https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2",
expected_size=291274499,
expected_md5="5b594c2af71ecf65505dc42d49ab6121",
callback=print_progress,
frequency=1.0
)
asyncio.run(main())

Other Considerations

  • Limit Concurrency: For large files or many simultaneous downloads, use a semaphore or queue to avoid overwhelming your system.
  • Validate Everything: Always check files after download to guarantee data integrity.
  • Explore Alternatives: While aiohttp is excellent, consider httpx for advanced async HTTP needs.

Conclusion

With aiohttp, you get speed, reliability, and flexibility—perfect for data engineering, web scraping, and AI workflows.

If you found this post helpful, consider subscribing to my newsletter for more deep dives into Python, AI, and engineering best practices.

Have questions, feedback, or your own download tips? Drop a comment below

Learn more Robust Async File Downloader in Python with aiohttp

Leave a Reply

Your email address will not be published. Required fields are marked *