How to Download Kaggle Datasets Using Python and PySpark

Kaggle is one of the most popular platforms for data scientists and machine learning enthusiasts. It provides access to a vast collection of datasets that you can use for exploratory data analysis, model building, and machine learning projects.

In this article, I’ll guide you through the process of downloading Kaggle datasets using Python and loading them into PySpark for big data processing.

Prerequisites

Before we start, make sure you have the following installed:

Python 3.7+
Kaggle API (kaggle) and KaggleHub (kagglehub) libraries
Pandas (pandas) for data manipulation
PySpark (pyspark) for big data processing

Installing Dependencies

Run the following command to install the required libraries:

pip install kaggle kagglehub pandas pyspark

Setting Up Kaggle API Credentials

To download datasets from Kaggle, you need to authenticate using an API key. Follow these steps:

Step 1: Get Your Kaggle API Key

  1. Go to Kaggle.
  2. Click on your profile picture (top-right) → Select Account.
  3. Scroll down to the API section and click on Create New API Token.
  4. This will download a file named kaggle.json.

Step 2: Move kaggle.json to the Correct Location

  • Move the kaggle.json file to your Kaggle API folder
mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
  • On Windows, place kaggle.json in C:\Users\<YourUsername>\.kaggle\

Download Kaggle Dataset Using Python

Now that the Kaggle API is set up, let’s download a dataset using KaggleHub.

Step 1: Download Dataset Using kagglehub

import kagglehub
import os
import pandas as pd
# Download dataset
dataset_path = kagglehub.dataset_download("t20matches")
print("Dataset downloaded at:", dataset_path)
# List dataset files
dataset_files = os.listdir(dataset_path)
print("Files in dataset:", dataset_files)

Step 2: Load the Dataset Into Pandas

# Find the first CSV file in the dataset
csv_file = None
for file in dataset_files:
if file.endswith(".csv"):
csv_file = os.path.join(dataset_path, file)pyth
break
if csv_file:
df = pd.read_csv(csv_file)
print("\n First 5 rows of the dataset:")
print(df.head())
else:
print("\n No CSV file found in the dataset.")

Full Python Script

import kagglehub
import os
import pandas as pd
# Step 1: Download the latest dataset version
dataset_path = kagglehub.dataset_download("t20matches")
print("Dataset downloaded at:", dataset_path)
# Step 2: List all files in the downloaded dataset directory
dataset_files = os.listdir(dataset_path)
print("Files in dataset:", dataset_files)
# Step 3: Load a specific file (e.g., CSV) into Pandas
csv_file = None
# Find the first CSV file in the dataset
for file in dataset_files:
if file.endswith(".csv"):
csv_file = os.path.join(dataset_path, file)
break
if csv_file:
df = pd.read_csv(csv_file)
print(df.head())
else:
print("\n No CSV file found in the dataset.")

Load Kaggle Dataset Into PySpark

If you’re working with large datasets, PySpark is the best choice for distributed computing.

Step 1: Initialize PySpark

from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder \
.appName("KaggleDatasetProcessing") \
.getOrCreate()

Step 2: Read the CSV File Into a Spark DataFrame

# Load CSV file into PySpark DataFrame
if csv_file:
spark_df = spark.read.option("header", "true").csv(csv_file)
print("\n PySpark DataFrame Schema:")
spark_df.printSchema()
else:
print("\n No CSV file found to load into PySpark.")

Conclusion

In this article, we covered:

  • Setting up Kaggle API credentials
  • Downloading datasets using KaggleHub
  • Loading datasets into Pandas
  • Processing large datasets using PySpark

Now, you can use these techniques to explore and analyze Kaggle datasets efficiently!

Next Steps

  • Try different Kaggle datasets and load them into PySpark.
  • Use PySpark SQL for querying large datasets.
  • Apply ML models on big datasets using PySpark MLlib.

Let me know if you have any questions! Happy coding!

Learn more How to Download Kaggle Datasets Using Python and PySpark

Leave a Reply