Kaggle is one of the most popular platforms for data scientists and machine learning enthusiasts. It provides access to a vast collection of datasets that you can use for exploratory data analysis, model building, and machine learning projects.
In this article, I’ll guide you through the process of downloading Kaggle datasets using Python and loading them into PySpark for big data processing.
Prerequisites
Before we start, make sure you have the following installed:
Python 3.7+
Kaggle API (kaggle
) and KaggleHub (kagglehub
) libraries
Pandas (pandas
) for data manipulation
PySpark (pyspark
) for big data processing
Installing Dependencies
Run the following command to install the required libraries:
pip install kaggle kagglehub pandas pyspark
Setting Up Kaggle API Credentials
To download datasets from Kaggle, you need to authenticate using an API key. Follow these steps:
Step 1: Get Your Kaggle API Key
- Go to Kaggle.
- Click on your profile picture (top-right) → Select Account.
- Scroll down to the API section and click on
Create New API Token
. - This will download a file named
kaggle.json
.
Step 2: Move kaggle.json to the Correct Location
- Move the
kaggle.json
file to your Kaggle API folder
mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
- On Windows, place
kaggle.json
inC:\Users\<YourUsername>\.kaggle\
Download Kaggle Dataset Using Python
Now that the Kaggle API is set up, let’s download a dataset using KaggleHub.
Step 1: Download Dataset Using kagglehub
import kagglehub
import os
import pandas as pd
# Download dataset
dataset_path = kagglehub.dataset_download("t20matches")
print("Dataset downloaded at:", dataset_path)
# List dataset files
dataset_files = os.listdir(dataset_path)
print("Files in dataset:", dataset_files)
Step 2: Load the Dataset Into Pandas
# Find the first CSV file in the dataset
csv_file = None
for file in dataset_files:
if file.endswith(".csv"):
csv_file = os.path.join(dataset_path, file)pyth
break
if csv_file:
df = pd.read_csv(csv_file)
print("\n First 5 rows of the dataset:")
print(df.head())
else:
print("\n No CSV file found in the dataset.")
Full Python Script
import kagglehub
import os
import pandas as pd
# Step 1: Download the latest dataset version
dataset_path = kagglehub.dataset_download("t20matches")
print("Dataset downloaded at:", dataset_path)
# Step 2: List all files in the downloaded dataset directory
dataset_files = os.listdir(dataset_path)
print("Files in dataset:", dataset_files)
# Step 3: Load a specific file (e.g., CSV) into Pandas
csv_file = None
# Find the first CSV file in the dataset
for file in dataset_files:
if file.endswith(".csv"):
csv_file = os.path.join(dataset_path, file)
break
if csv_file:
df = pd.read_csv(csv_file)
print(df.head())
else:
print("\n No CSV file found in the dataset.")
Load Kaggle Dataset Into PySpark
If you’re working with large datasets, PySpark is the best choice for distributed computing.
Step 1: Initialize PySpark
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder \
.appName("KaggleDatasetProcessing") \
.getOrCreate()
Step 2: Read the CSV File Into a Spark DataFrame
# Load CSV file into PySpark DataFrame
if csv_file:
spark_df = spark.read.option("header", "true").csv(csv_file)
print("\n PySpark DataFrame Schema:")
spark_df.printSchema()
else:
print("\n No CSV file found to load into PySpark.")
Conclusion
In this article, we covered:
- Setting up Kaggle API credentials
- Downloading datasets using KaggleHub
- Loading datasets into Pandas
- Processing large datasets using PySpark
Now, you can use these techniques to explore and analyze Kaggle datasets efficiently!
Next Steps
- Try different Kaggle datasets and load them into PySpark.
- Use PySpark SQL for querying large datasets.
- Apply ML models on big datasets using PySpark MLlib.
Let me know if you have any questions! Happy coding!
Learn more How to Download Kaggle Datasets Using Python and PySpark