High-Performance File Downloads in Python with PycURL

Have you ever needed to download massive datasets, model checkpoints, or other large files in your Python projects — and wished for a tool that’s both fast and reliable . In this post, I’ll show you how to harness the power of PycURL for high-performance, robust file downloads. We’ll build a reusable download class that:

  • Handles chunked downloads for efficiency
  • Validates file size and MD5 checksums to ensure data integrity
  • Skips unnecessary downloads to save bandwidth and time
  • Provides real-time progress updates via callbacks

You’ll have a production-ready solution for your data engineering, machine learning, or automation pipelines.

Prerequisites: Installing PycURL

Before we get started, make sure you have PycURL installed. Here’s how:

pip install pycurl

On macOS or Linux, you may also need the libcurl development headers:

brew install curl  # macOS
# or
sudo apt-get install libcurl4-openssl-dev # Ubuntu/Debian

Building a Robust Download Class

Let’s architect a Python class that makes file downloads both efficient and bulletproof. Here’s what our class will do:

  • Accept a download directory, chunk size, and an optional progress callback
  • Check if the file already exists and is valid (by size and MD5)
  • Download only if needed, with chunked streaming and progress updates
  • Validate the downloaded file for size and MD5 integrity

This approach is perfect for AI workflows, where downloading large, versioned datasets or model weights is common.

import os
import pycurl
import hashlib
class FileDownloader:
def __init__(self, download_dir, chunk_size=8192, callback=None):
self.download_dir = download_dir
self.chunk_size = chunk_size
self.callback = callback
def _md5sum(self, file_path):
"""Compute md5 hash of a file asynchronously."""
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
while True:
chunk = f.read(self.chunk_size)
if not chunk:
break
hash_md5.update(chunk)
return hash_md5.hexdigest()
def _validate_file(self, file_path, expected_size=None, expected_md5=None):
if not os.path.exists(file_path):
return False
if expected_size is not None and os.path.getsize(file_path) != expected_size:
return False
if expected_md5 is not None:
actual_md5 = self._md5sum(file_path)
if actual_md5 != expected_md5:
return False
return True
def download(self, url, filename, expected_size=None, expected_md5=None):
file_path = os.path.join(self.download_dir, filename)
# Check if file already exists and is valid
if self._validate_file(file_path, expected_size, expected_md5):
print(f"File {file_path} already exists and is valid. Skipping download.")
return file_path
md5 = hashlib.md5()
received = 0
# Try to get total bytes from Content-Length header
total_bytes = None
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.NOBODY, 1)
c.perform()
try:
total_bytes = int(c.getinfo(pycurl.CONTENT_LENGTH_DOWNLOAD))
except Exception:
total_bytes = expected_size
c.close()
def write_callback(data):
nonlocal received
f.write(data)
md5.update(data)
received += len(data)
if self.callback:
self.callback(received, total_bytes)
with open(file_path, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, write_callback)
c.perform()
c.close()
# Check file size
if expected_size is not None and received != expected_size:
raise ValueError(f"Size mismatch: expected {expected_size}, got {received}")
# Check MD5
if expected_md5 is not None and md5.hexdigest() != expected_md5:
raise ValueError("MD5 checksum mismatch")
return file_path

Usage Example: Downloading with Progress and Validation

Here’s how you can use the FileDownloader in your own projects:

def progress(bytes_downloaded, total_bytes):
percent = (bytes_downloaded / total_bytes) * 100 if total_bytes else 0
print(f"Downloaded {bytes_downloaded}/{total_bytes} bytes ({percent:.2f}%)")
url = 'https://dumps.wikimedia.org/enwiki/20250920/enwiki-20250920-pages-articles1.xml-p1p41242.bz2'
md5 = "5b594c2af71ecf65505dc42d49ab6121"
size = 291274499
downloader = FileDownloader(download_dir="/tmp", chunk_size=16384, callback=progress)
await downloader.download(
url=url,
filename="largefile.bz2",
expected_size=size,
expected_md5=md5
)

Conclusion

In this guide, we explored how to leverage PycURL for high-performance, reliable file downloads in Python. The source code can be found at the link

If you found this post helpful, consider subscribing to my newsletter for more deep dives into Python, AI, and engineering best practices.

Have questions, feedback, or your own download tips? Drop a comment below

Learn more High-Performance File Downloads in Python with PycURL

Leave a Reply