Training and fine-tuning AI models needs a large amount of data. Data can be hosted on various platforms, Hugging Face would be the best option these days — but Box is also a popular choice.
Although it’s convenient to download data on Box through a browser, it becomes a pain in the neck when developers like us try to do the same in a terminal environment.
Fortunately Box development toolkit in Python provides several APIs to do these tasks for us, let’s explore and learn how to use it.
Getting Started
First, we gonna install the SDK by pip. For those who aren’t familiar, an SDK (Software Development Kit) is basically a toolbox containing lots of libraries for building a particular application .
pip install boxsdk
Authentication
To access Box APIs, you need to prove who you are, and that’s what authentication is all about. Let’s start with the Developer Token — a temporary key for accessing the APIs.
To create a Developer Token, make sure you have a Box account. Then, go to the developer console and create a Custom App.
An application is simply a proxy of unique identity that play with Box APIs. By creating an application, you can also share it with others (like your collaborators) if you want.
Upon creating an app, you’ll see three authentication options. These determine how everyone accesses your application.
- Server Authentication (with JWT)
This method creates a pair of keys — a public key and a private key. Anyone who needs to access the application must carry a JWT (the public key) in the request header, and Box verifies it with the private key. The main benefit is that each access does not require a separate Box account. - OAuth 2.0 (User Login Authentication)
Users must log into their Box accounts, allowing the application to identify who they are. - Client ID and Client Secret Authentication
Users must provide a Client ID and a Secret Key to prove their identity. This is just like having a username and password in your own application.
Let’s just go for JWT for now since any method allow us use a developer token for testing.
Enable Write Permissions
In Application Scopes, make sure to turn on writing permissions so your application can upload and download files in Box.
Generating a Developer Token
After creating the app, you can generate a Developer Token on your app page.
Copy the token and run the following code. Then enter your Developer Token so you can log in as a development client.
from boxsdk import DevelopmentClient
# This will prompt you for your developer token
client = DevelopmentClient()
Basic Operations
You can get username and email by running the following code.
me = client.user(user_id='me').get()
print(f"User: {me.name}, Email: {me.login}")
You can list all the files in your Box drive by running this command.
root_folder = client.folder(folder_id='0') # '0' is the ID for the root folder
items = root_folder.get_items()
for item in items:
print(f"Item Name: {item.name}, Type: {item.type}")
In my drive, there’s only one folder in my Box account. Note that you can obtain the folder or file ID from item.id
or by visiting the page in a browser (file ID is crucial for some operations).
You can also download files from a folder. Here, there’s just one image in the test_download
folder.
Let’s run the following code to download the image.
folder_id = 307995450569
folder_to_download = client.folder(folder_id=folder_id)
for file_to_download in folder_to_download.get_items():
file_name = file_to_download.name
bytes_file = file_to_download.content()
print(f'the file name is {file_name}') # the file name is cow_beach_1.png
with open(file_name, "wb") as file:
file.write(bytes_file)
API Documentations
This article only covers the basic setups and operations. You can explore more APIs functions on your own in the official documentation to build your own applications.
Download Files From Shared Directory
While we usually don’t host data by ourselves and instead download datasets shared by others. Imagine you’ve been given a shared link and its password. First, inspect the folder’s contents
file = client.get_shared_item('<the_shared_link>', password='<password_of_the_shared_link>')
Running this command will get metadata about the shared item.
Note that if the link points to a folder, the API response doesn’t include the list of files inside it. To see and download those nested items, we will need to walk through the items in it. Let’s check the type of the object we received first:
print(type(file)) # boxsdk.object.folder.Folder
Since it’s a Box Folder object, you can use get_items() to iterate through its contents:
item_list = [item for item in file.get_items()]
print(item_list)
# [<Box Folder - 833833833833 (Video)>,
# <Box File - 512512512512 (readme.txt)>]
Suppose you want to download everything in the “Video” subfolder. First, select it from the list you just fetched:
video_folder = item_list[0]
print(video_folder.name)
print(type(video_folder))
# Video
# <class 'boxsdk.object.folder.Folder'>
Next, create a local directory where you’ll save the video clips:
from pathlib import Path
download_dir = Path("/path/for/saving")
download_dir.mkdir(parents=True, exist_ok=True)
Finally, iterate through the items in that folder and download only the MP4 files:
for item in video_folder.get_items():
filename = item.name # e.g. "holiday_clip.mp4"
ext = Path(filename).suffix.lower()
# only fetch + save if it’s an MP4
if ext == ".mp4":
print(f"Downloading {filename} …")
bytes_file = item.content() # <- network call happens here
target_path = download_dir / filename # ~/Videos/box_mp4s/holiday_clip.mp4
with target_path.open("wb") as f:
f.write(bytes_file)
print(f"Saved to {target_path}")
else:
# decide what to do with non-MP4s (skip / save elsewhere / etc.)
print(f"Skipping non-MP4 file: {filename}")
And that’s it! You now have all the MP4s from the shared folder saved to your local directory.
Learn more Box Python SDK Tutorial: How to Download Files in Box with Python