How To Download and Use a Hugging Face Model Locally

Learn how to download a model from Hugging Face via the Terminal, load it locally, and run it in Python.

This tutorial will teach you the following:

  1. How to download a Hugging Face model using the terminal.
  2. How to run your model locally in Python.

This is useful when you need the model weights for deployment, i.e. to a Docker container or similar.

Prerequisites

This tutorial assumes you have Python installed. If not, you can download Python here:

Get the latest stable release, and you’ll be ready to complete this tutorial!

Another way to get Python

There are many ways to download Python. A direct download from python.org is probably the simplest. But my favourite, in 2025, is uv:

This is slightly more involved, so only go down this route if you have a decent amount of prior development experience.

Step 1 — Download the Hugging Face CLI

Before downloading a model, you need to install the Hugging Face CLI. You can do this with the following command in the terminal:

pip install -U "huggingface_hub"

Once installed, you can check that this has worked by running this command in the terminal:

huggingface-cli        

You should see output like this:

usage: huggingface-cli <command> [<args>]
positional arguments:
{download,upload,repo-files,env,login,whoami,logout,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache,tag}
huggingface-cli command helpers
download Download files from the Hub
upload Upload a file or a folder to a repo on the Hub
repo-files Manage files in a repo on the Hub
env Print information about the environment.
login Log in using a token from huggingface.co/settings/tokens
whoami Find out which huggingface.co account you are logged in as.
logout Log out
repo {create} Commands to interact with your huggingface.co repos.
lfs-enable-largefiles
Configure your repository to enable upload of files > 5GB.
scan-cache Scan cache directory.
delete-cache Delete revisions from the cache directory.
tag (create, list, delete) tags for a repo in the hub
options:
-h, --help show this help message and exit

If you see this output or something similar, move on to the next step.

Step 2 — Downloading the Model

Visit the page for the model you want to download. For example, I’m interested in this Qwen embedding model:

The model card for the Qwen3 4 billion parameter embedding model.

Copy the name of the model by clicking the copy button at the very top of this page:

The title of the model with copy button.

Paste it into this command on the terminal:

huggingface-cli download <your-copied-model> --local-dir ./path/to/your/dir

As an example, here’s what my command looks like:

huggingface-cli download Qwen/Qwen3-Embedding-4B --local-dir ./qwen-embedding-model-4b

You should see output like the following:

Fetching 14 files:   0%|                                                                                                                                                                                                                                                    | 0/14 [00:00<?, ?it/s]Downloading 'config.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/config.json.8b4b87fc69023e7a224eb6563753aaf3223d8b98.incomplete'
Downloading 'model-00001-of-00002.safetensors' to 'qwen-embedding-model-4b/.cache/huggingface/download/model-00001-of-00002.safetensors.e70bfe3c970523fb7ef4eddffed2254ce3f1e7150c3de2af4342de129dd756f8.incomplete'
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 727/727 [00:00<00:00, 92.2kB/s]
Download complete. Moving file to qwen-embedding-model-4b/config.json | 0.00/727 [00:00<?, ?B/s]
Downloading 'README.md' to 'qwen-embedding-model-4b/.cache/huggingface/download/README.md.81d922bc72353348a181473b9cc0ee53571ae13b.incomplete'
Downloading 'config_sentence_transformers.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/config_sentence_transformers.json.76aef3ade63553ebb698fe3c2a3264040ed093f8.incomplete'
Downloading 'generation_config.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/generation_config.json.d46f1983345269c582611bbedb3ca0a13f8e5f7b.incomplete'
Downloading 'merges.txt' to 'qwen-embedding-model-4b/.cache/huggingface/download/merges.txt.31349551d90c7606f325fe0f11bbb8bd5fa0d7c7.incomplete'
config_sentence_transformers.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 215/215 [00:00<00:00, 2.44MB/s]
Download complete. Moving file to qwen-embedding-model-4b/config_sentence_transformers.json
README.md: 17.3kB [00:00, 2.08MB/s] 0%| | 0.00/215 [00:00<?, ?B/s]
Download complete. Moving file to qwen-embedding-model-4b/README.md
Downloading '1_Pooling/config.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/1_Pooling/config.json.81de5602eacbce382009c5af7a23085871801d8f.incomplete'
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 313/313 [00:00<00:00, 4.67MB/s]
Download complete. Moving file to qwen-embedding-model-4b/1_Pooling/config.json | 0.00/4.97G [00:00<?, ?B/s]
Downloading 'model-00002-of-00002.safetensors' to 'qwen-embedding-model-4b/.cache/huggingface/download/model-00002-of-00002.safetensors.ed1b87c8e9eb7e535a1a155e4fd00d9f4dba80e58a6db48a4c9f82cede7079c1.incomplete'
merges.txt: 1.67MB [00:00, 27.0MB/s] | 0.00/313 [00:00<?, ?B/s]
Download complete. Moving file to qwen-embedding-model-4b/merges.txt
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 117/117 [00:00<00:00, 1.49MB/s]
Download complete. Moving file to qwen-embedding-model-4b/generation_config.json
Downloading '.gitattributes' to 'qwen-embedding-model-4b/.cache/huggingface/download/.gitattributes.52373fe24473b1aa44333d318f578ae6bf04b49b.incomplete' | 0.00/117 [00:00<?, ?B/s]
Downloading 'modules.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/modules.json.952a9b81c0bfd99800fabf352f69c7ccd46c5e43.incomplete'
Downloading 'model.safetensors.index.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/model.safetensors.index.json.3d736ef26714eee0abde3e05104ee1b3ec26c974.incomplete' | 0.00/3.08G [00:00<?, ?B/s]
Downloading 'tokenizer.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/tokenizer.json.83cdf8c3a34f68862319cb1810ee7b1e2c0a44e0864ae930194ddb76bb7feb8d.incomplete'
modules.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 5.93MB/s]
Download complete. Moving file to qwen-embedding-model-4b/modules.json
model.safetensors.index.json: 30.4kB [00:00, 1.42MB/s]
Download complete. Moving file to qwen-embedding-model-4b/model.safetensors.index.json
Downloading 'tokenizer_config.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/tokenizer_config.json.df3a9d96759529ca1006eb6db024bbb099a97578.incomplete' | 0.00/349 [00:00<?, ?B/s]
.gitattributes: 1.57kB [00:00, 6.52MB/s]
Download complete. Moving file to qwen-embedding-model-4b/.gitattributes | 10.5M/4.97G [00:00<01:33, 52.8MB/s]
Downloading 'vocab.json' to 'qwen-embedding-model-4b/.cache/huggingface/download/vocab.json.4783fe10ac3adce15ac8f358ef5462739852c569.incomplete'
tokenizer_config.json: 7.26kB [00:00, 279kB/s] | 1/14 [00:00<00:06, 1.96it/s]
Download complete. Moving file to qwen-embedding-model-4b/tokenizer_config.json
vocab.json: 2.78MB [00:00, 3.04MB/s]
Download complete. Moving file to qwen-embedding-model-4b/vocab.json | 73.4M/4.97G [00:01<01:16, 64.1MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:04<00:00, 2.67MB/s]
Download complete. Moving file to qwen-embedding-model-4b/tokenizer.json | 273M/4.97G [00:04<01:20, 58.3MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.08G/3.08G [01:42<00:00, 30.0MB/s]
Download complete. Moving file to qwen-embedding-model-4b/model-00002-of-00002.safetensors█████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 3.70G/4.97G [01:42<00:50, 25.0MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.08G/3.08G [01:42<00:00, 40.4MB/s]

Wait for the various downloads to complete and move on to the next step.

Step 3 — Loading the Model

If you don’t have them already, you’ll need to install torch and transformers. You can do this with the following terminal command:

pip install torch, transformers

How you load the model is determined by the model itself. An easy way to see the correct method is to scroll down on the model card:

The Qwen 0.6b text generation model instructions for usage.

You will typically see something like this, which explains how you can use the model effectively. However, for clarity, I will show two standard examples for an embedding model and a text generation model.

Embedding Model

The easiest way to load one of these models is with SentenceTransformers. They provide an API that wraps most state-of-the-art embedding models and simplifies their usage.

To load your model, you can do this:

from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("path/to/your/model/directory")

Here’s an example of me loading the Qwen 4b embedding model:

from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("./qwen-embedding-model-4b")

Text Generation Model

To load text generation models, you use the standard Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
# Path to the folder where you downloaded the model
model_path = "path/to/your/model/directory"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto"
)

As an example, here’s me loading the Qwen 0.6b text generation model:

from transformers import AutoTokenizer, AutoModelForCausalLM
# Path to the folder where you downloaded the model
model_path = "./qwen-text-generation-model-0_6b"
# Load tokenizer and model from local files
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

This gives you a chat API to use for generating text.

Step 4 — Using the Model

Just as in the previous step, the model card will typically show you how to use the loaded model. However, here are two examples for embedding and text generation:

Embedding with Sentence Transformers

As we’ve loaded the model with sentence_transformers we have access to the standard API. Which means we can embed and check similarity for queries and documents with:

# The queries and documents to embed
queries = [
"What is the capital of China?",
"Explain gravity",
]
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
# Encode the queries and documents. Note that queries benefit from using a
# prompt. Here we use the prompt called "query" stored under `model.prompts`,
# but you can also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7534, 0.1147],
# [0.0320, 0.6258]])

This is straight from Qwen’s model documentation, but the only potentially unique part is prompt_name="query" on the query_embeddings line. This prompt is stored inside the model definition folder and is used to control how the text is embedded.

For Qwen 4b, you can find it under config_sentence_transformers.json:

{
"prompts": {
"query": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:",
"document": ""
},
"default_prompt_name": null,
"similarity_fn_name": "cosine"
}

If the standard prompt doesn’t apply to your use case, you can add another one and use that instead. Make sure you use the Instruct: and \nQuery: format when you do.

Text Generation with Transformers

To generate text, you can use the standard transformers API:

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)

Again, this is taken directly from the Qwen 0.6b text generation model. But the same methodology can be applied to other thinking models.

Final Words

That’s it! I hope this helps you work with models without internet access and upload the weights and configuration directly to functions for your use case. Happy transforming!

Subscribe If

You like regular batteries-included tutorials that help you become a master Full Stack Developer!

Learn more How To Download and Use a Hugging Face Model Locally

Leave a Reply