Featured
Introduction
Efficiently downloading large files in Python can be challenging, especially when you want to support asynchronous downloads, caching, file validation, and real-time progress feedback. In this blog, we’ll walk through building a production-ready async file downloader using the aiohttp
library, with features like cache validation, file size and MD5 hash checking, and a customizable progress callback.
Why Choose aiohttp for Asynchronous Downloads?
aiohttp
is my go-to library for asynchronous HTTP operations in Python. It’s fast, mature, and designed for non-blocking network tasks—making it ideal for:
- Downloading large files efficiently
- Handling multiple downloads concurrently
- Integrating with modern async Python workflows
Key Features of Our Async File Downloader
Here’s what sets this downloader apart:
- Asynchronous Downloading: Harness the power of async/await for non-blocking file transfers.
- Smart Caching: Skip downloads if the file already exists and matches expected size or MD5 hash.
- Robust Validation: Automatically check file size and MD5 hash after download to ensure integrity.
- Custom Progress Callback: Get real-time feedback with a callback function for download progress.
Implementation Overview
Below is a streamlined version of the AsyncFileDownloader
class. It’s designed for clarity and extensibility:
import aiohttp
import asyncio
import hashlib
import os
import time
class AsyncFileDownloader:
def __init__(self, output_dir="."):
self.output_dir = output_dir
os.makedirs(self.output_dir, exist_ok=True)
async def _md5sum(self, file_path, chunk_size=8192):
md5 = hashlib.md5()
with open(file_path, "rb") as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
md5.update(chunk)
return md5.hexdigest()
async def _validate_file(self, file_path, expected_size=None, expected_md5=None):
if not os.path.exists(file_path):
return False
if expected_size is not None and os.path.getsize(file_path) != expected_size:
return False
if expected_md5 is not None:
actual_md5 = await self._md5sum(file_path)
if actual_md5 != expected_md5:
return False
return True
async def download(self, url, filename=None, expected_size=None, expected_md5=None, callback=None, frequency=0.5):
if not filename:
filename = os.path.basename(url)
file_path = os.path.join(self.output_dir, filename)
# Cache validation
if await self._validate_file(file_path, expected_size, expected_md5):
print(f"File {file_path} already valid. Skipping download.")
return file_path
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
resp.raise_for_status()
total_bytes = int(resp.headers.get('Content-Length', 0)) or expected_size or 0
bytes_downloaded = 0
start_time = time.time()
last_callback = start_time
with open(file_path, "wb") as f:
async for chunk in resp.content.iter_chunked(8192):
f.write(chunk)
bytes_downloaded += len(chunk)
now = time.time()
if callback and (now - last_callback >= frequency or bytes_downloaded == total_bytes):
time_elapsed = now - start_time
callback(bytes_downloaded, total_bytes, time_elapsed)
last_callback = now
print(f"Downloaded {file_path}")
# Post-download validation
if not await self._validate_file(file_path, expected_size, expected_md5):
os.remove(file_path)
raise ValueError(f"Downloaded file {file_path} failed validation.")
return file_path
Let’s break down the workflow:
- Initialization: Set your output directory for downloads.
- Cache Validation: Before downloading, check if the file already exists and matches the expected size or MD5 hash.
- Async Download: If needed, stream the file in chunks and write to disk.
- Progress Callback: Receive real-time updates on download progress, bytes transferred, and elapsed time.
- Post-download Validation: After download, validate the file again. If it fails, delete and raise an error.
Example Usage
Here’s how you can use the downloader in your own projects:
import asyncio
def print_progress(bytes_downloaded, total_bytes, time_elapsed):
percent = (bytes_downloaded / total_bytes) * 100 if total_bytes else 0
print(f"Downloaded: {bytes_downloaded}/{total_bytes} bytes ({percent:.2f}%), Time elapsed: {time_elapsed:.2f}s")
async def main():
downloader = AsyncFileDownloader(output_dir="downloads")
await downloader.download(
url="https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2",
expected_size=291274499,
expected_md5="5b594c2af71ecf65505dc42d49ab6121",
callback=print_progress,
frequency=1.0
)
asyncio.run(main())
Other Considerations
- Limit Concurrency: For large files or many simultaneous downloads, use a semaphore or queue to avoid overwhelming your system.
- Validate Everything: Always check files after download to guarantee data integrity.
- Explore Alternatives: While
aiohttp
is excellent, considerhttpx
for advanced async HTTP needs.
Conclusion
With aiohttp
, you get speed, reliability, and flexibility—perfect for data engineering, web scraping, and AI workflows.
If you found this post helpful, consider subscribing to my newsletter for more deep dives into Python, AI, and engineering best practices.
Have questions, feedback, or your own download tips? Drop a comment below
Learn more Robust Async File Downloader in Python with aiohttp