In more general terms, RAG is a very important concept, especially when crafting more specialized LLM applications. This concept can avoid the risk of false positives, giving wrong answers, or hallucinations in general.
These are some open-source projects that might be helpful when approaching RAG in one of your projects:
- txtai: All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows.
- LangChain: LangChain is a framework for developing applications powered by large language models (LLMs).
- Qdrant: Vector Search Engine for the next generation of AI applications.
- Weaviate: Weaviate is a cloud-native, open source vector database that is robust, fast, and scalable.
Of course, with the potential value of this approach for LLM-based applications, there are many more open- and close-source alternatives, but with these, you should be able to get your research on the topic started.
Hackathon Guide: Building Your First Mental Health Chatbot
Disclaimer: the following chapter should help people, who are interested in developing an AI-powered chatbot to get started. This is not meant to be a sophisticated, production-ready mental health support solution.
To inspire you to get started with your own AI-driven project in regards to mental health, let’s fine-tune a LLM and create our own, basic AI-powered mental health support chatbot step by step.
Since fine-tuning a LLM needs a decent environment, I am using a Jupyter notebook running in a Google Cloud Vertex AI Workbench instance. Vertex AI Workbench instances are Jupyter notebook-based development environments for the entire data science workflow. These instances are prepackaged with JupyterLab and have a preinstalled suite of deep learning packages, including support for the TensorFlow and PyTorch frameworks. You can configure different types of instances based on your needs.
To finish the fine-tuning process in a reasonable amount of time and to have access to some modern features like FlashAttention (explained later), I used the following machine type:
- GPU type: NVIDIA A100 80GB
- Number of GPUs: 1
- 12 vCPUs
- 6 cores
- 170 GB memory
Running this instance costs around $4.193 hourly. Since you pay for instances for what you use, it means there are no upfront costs and per-second billing. The fine-tuning process will take around 30 minutes, so the total costs are around $2.
You can also run the process on your local machine or using Google Colab, which is a web-based platform built around Jupyter Notebooks. You access Colab through your web browser, no software installation needed on your own computer.
The code you run in Colab actually executes on powerful machines in Google’s cloud, not your personal computer. This gives you access to advanced hardware like GPUs and TPUs, which are great for speeding up data analysis and machine learning tasks.
Colab provides a user-friendly environment with powerful computing resources in the cloud, all accessible through your web browser, and what is really cool: you can get started for free. The free tier already offers access to hardware accelerator options however, free Colab resources are not guaranteed and not unlimited, and usage limits sometimes fluctuate. These interruptions might be frustrating, but this is the price for having a sophisticated, free notebook platform.
Talking about prices, of course you can upgrade to another plan, including Pay As You Go or Colab Pro.
For this example, the free version with a T4 GPU would not offer enough resources for the fine-tuning process, which is why I chose a more sophisticated Vertex AI Workbench instance. However, Colab is a great way to get started with such projects so I still wanted to mention this option.
Parameter-Efficient Fine-Tuning (PEFT) of Llama 2 with Mental Health Counseling Data
With a NVIDIA A100 80GB Tensor-Core-GPU, we have a really good basis for our fine-tuning process.
Like explained earlier, fine-tuning LLMs is often costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient alternatives by only fine-tuning a small number of model parameters.
In this example, we will use the meta-llama/Llama-2-7b-chat-hf
by Meta hosted on Hugging Face. This model uses 7 billion parameters, optimized for dialogue. To fine-tune this model, we will use the Amod/mental_health_counseling_conversations
dataset, also available on Hugging Face, which contains a collection of questions and answers sourced from two online counseling and therapy platforms, covering a wide range of mental health topics.
The basic idea is: we load the model, tokenizer and dataset from Hugging Face. Then we create a LoraConfig
with settings based on the previously mentioned Quantized LoRA (QLoRA) paper, then we prepare the model for training, configure a so called SFTTrainer
(Supervised Fine-Tuning Trainer) for the fine-tuning process, train the model, save the model and then push this fine-tuned model back to Hugging Face so that we can use it in an application later.
As explained, I am running the process within a Jupyter notebook, so let’s break down each individual step of the fine-tuning procedure.
First, we install all required libraries, including PyTorch and the toolkits provided by Hugging Face. The environment this is running in allows us to use FlashAttention based on the FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness paper, which requires CUDA 11, NVCC, and a Turing or Ampere GPU. This particular dependency has to be installed after torch
, so we run it in a second step separately.
pip install torch torchvision datasets transformers tokenizers bitsandbytes peft accelerate trl
pip install flash-attn
Then, we import everything we need for the fine-tuning process:
import gc
import torchfrom datasets import load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from trl import SFTTrainer
Next, we setup some variables to specify the model we are going to use, the dataset but also the Hugging Face User Access Token. This token is used to interact with the Hugging Face platform to download and publish models, datasets and more. To create a token, you can register for free at https://huggingface.co/, then open up your account settings and select Access Tokens from the menu. For this process, we need a token with write access since we are going to publish the fine-tuned model to Hugging Face later.
If you want to try the fine-tuning yourself, just replace the placeholder in the following code with your own token.
# see: https://huggingface.co/docs/hub/security-tokens
# must be write token to push model later
hf_token = "your-token"# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
base_model = "meta-llama/Llama-2-7b-chat-hf"
# https://huggingface.co/datasets/Amod/mental_health_counseling_conversations
fine_tuning_dataset = "Amod/mental_health_counseling_conversations"
# name for output model
target_model = "vojay/Llama-2-7b-chat-hf-mental-health"
For the next part it is important to understand, that prompts are usually created by multiple elements following a specific template. This depends on the model of course and the llama-2-chat
model uses the following format, based on the Llama 2 paper, to define system and instruction prompts:
[INST] >
{{ system_prompt }}
>
{{ user_message }} [/INST] {{ model_response }}
This format might look cryptic at first, but it becomes more clear when looking at the individual elements:
: beginning of sequence.: end of sequence.
>
: beginning of system message.>
: end of system message.[INST]
: beginning of instructions.[/INST]
: end of instructions.system_prompt
: overall context for model responses.user_message
: user instructions for generating output.model_response
: expected model response for training only.
When we train the model, we must follow this format, so the next step is to define a proper template and functions to transform the sample data accordingly. Let’s start with the system or base prompt to create an overall context:
def get_base_prompt():
return """
You are a knowledgeable and supportive psychologist. You provide emphatic, non-judgmental responses to users seeking
emotional and psychological support. Provide a safe space for users to share and reflect, focus on empathy, active
listening and understanding.
"""
We will re-use this base prompt later to enrich user input before we send it to the LLM for evaluation. This would be a great opportunity for projects in this context, since the base prompt could be improved to make the LLM respond much better.
Now let’s define a function to format training data accordingly:
def format_prompt(base, context, response):
return f"[INST] >{base}>{context} [/INST] {response} "
The next part is the fine-tuning part itself, which is wrapped into a function, so that we first simply define the process and then execute it as the next step in the notebook:
def train_mental_health_model():
model = AutoModelForCausalLM.from_pretrained(
base_model,
token=hf_token,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False
),
torch_dtype=torch.float16, # reduce memory usage
attn_implementation="flash_attention_2" # optimize for tensor cores (NVIDIA A100)
)# LoRA config based on QLoRA paper
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=8,
bias="none",
task_type="CAUSAL_LM"
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
args = TrainingArguments(
output_dir=target_model, # model output directory
overwrite_output_dir=True, # overwrite output if exists
num_train_epochs=2, # number of epochs to train
per_device_train_batch_size=2, # batch size per device during training
gradient_checkpointing=True, # save memory but causes slower training
logging_steps=10, # log every 10 steps
learning_rate=1e-4, # learning rate
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
optim="paged_adamw_8bit", # memory-efficient variant of AdamW optimizer
lr_scheduler_type="constant", # constant learning rate
save_strategy="epoch", # save at the end of each epoch
evaluation_strategy="epoch", # evaluation at the end of each epoch,
fp16=True, # use fp16 16-bitprecision training instead of 32-bit to save memory
tf32=True # optimize for tensor cores (NVIDIA A100)
)
tokenizer = AutoTokenizer.from_pretrained(base_model, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# limit samples to reduce memory usage
dataset = load_dataset(fine_tuning_dataset, split="train")
train_dataset = dataset.select(range(2000))
eval_dataset = dataset.select(range(2000, 2500))
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
max_seq_length=1024,
tokenizer=tokenizer,
formatting_func=lambda entry: format_prompt(get_base_prompt(), entry["Context"], entry["Response"]),
packing=True,
args=args
)
gc.collect()
torch.cuda.empty_cache()
trainer.train()
trainer.save_model()
trainer.push_to_hub(target_model, token=hf_token)
I added comments to all training arguments to make the configuration transparent. However, the specifics depend on the environment you are running the training in and on the input model and dataset, so adjustments might be necessary.
Let’s have a closer look to how the process works in detail. With AutoModelForCausalLM.from_pretrained
we are loading the model and by setting the quantization_config
we transform it to 4-bit weights and activations through quantization, which offers benefits in regards to performance. By setting attn_implementation
to flash_attention_2
.
FlashAttention-2 is a faster and more efficient implementation of the standard attention mechanism that can significantly speed up inference by additionally parallelizing the attention computation over sequence length and partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them.
The LoraConfig
configures the Low-Rank Adaptation (LoRA) process. lora_alpha
controls the scaling factor for the weight matrices, the lora_dropout
sets the dropout probability of the LoRA layers, r
controls the rank of the low-rank matrices, bias
determines how bias terms are handled and task_type
reflects the task of the fine-tuned model.
Once the LoraConfig
is set up, we create the PeftModel
with the get_peft_model()
function.
With the prepared model, the next step is to prepare the training. To do so, we create a TrainingArguments
object, which controls all major aspects of the training process, including:
output_dir=target_model, # model output directory
overwrite_output_dir=True, # overwrite output if exists
num_train_epochs=2, # number of epochs to train
per_device_train_batch_size=2, # batch size per device during training
gradient_checkpointing=True, # save memory but causes slower training
logging_steps=10, # log every 10 steps
learning_rate=1e-4, # learning rate
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
optim="paged_adamw_8bit", # memory-efficient variant of AdamW optimizer
lr_scheduler_type="constant", # constant learning rate
save_strategy="epoch", # save at the end of each epoch
evaluation_strategy="epoch", # evaluation at the end of each epoch,
fp16=True, # use fp16 16-bitprecision training instead of 32-bit to save memory
tf32=True # optimize for tensor cores (NVIDIA A100)
Afterwards, we create the tokenizer for the model with AutoTokenizer.from_pretrained
.
The next step is to load the mental health dataset. Here we are limiting the sample size to reduce memory usage and speed up the training.
With all that, we can instantiate the SFTTrainer
, train, save and publish the fine-tuned model, using trainer.push_to_hub
.
In the next step, we call the train_mental_health_model()
and can simply watch while the magic happens:
train_mental_health_model()
I pushed the fine-tuned model to Hugging Face, so you can fetch it from there if you want to skip the fine-tuning process.
Keep in mind that this fine-tuned model is actually an adapter for the base model. Which means, in order to use it, we need to load the base model and apply this fine-tuning adapter:
model_id = "meta-llama/Llama-2-7b-chat-hf"
adapter_model_id = "vojay/Llama-2-7b-chat-hf-mental-health"model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model.load_adapter(adapter_model_id)
For a hackathon project, improving this fine-tuning process would be a great way to create a more sophisticated mental health support application.
Create a Chatbot with the Fine-tuned Model
Now that we have a fine-tuned model, let’s create a chatbot to make use of it. To keep it simple, we run a pragmatic CLI bot within the local environment.
Let’s start by having a closer look at how to create the project and how dependencies are managed in general. For this, we are using Poetry, a tool for dependency management and packaging in Python.
The three main tasks Poetry can help you with are: Build, Publish and Track. The idea is to have a deterministic way to manage dependencies, to share your project and to track dependency states.
Poetry also handles the creation of virtual environments for you. Per default, those are in a centralized folder within your system. However, if you prefer to have the virtual environment of project in the project folder, like I do, it is a simple config change:
poetry config virtualenvs.in-project true
With poetry new
you can then create a new Python project. It will create a virtual environment linking you systems default Python. If you combine this with pyenv, you get a flexible way to create projects using specific versions. Alternatively, you can also tell Poetry directly which Python version to use: poetry env use /full/path/to/python
.
Once you have a new project, you can use poetry add
to add dependencies to it.
Let’s start by creating the project for our bot and add all necessary dependencies:
poetry new mental-health-bot
cd mental-health-botpoetry add huggingface_hub
poetry add adapters
poetry add transformers
poetry add adapters
poetry add peft
poetry add torch
With that, we can create the app.py
main file with the code to run our bot. As before, if you want to run this on your own, please replace the Hugging Face token placeholder with your own token. This time, a read-only token is sufficient as we simply fetch the base and fine-tuned models from Hugging Face without uploading anything.
Another thing to mention is, that I am running this on the following environment:
- Apple MacBook Pro
- CPU: M1 Max
- Memory: 64 GB
- macOS: Sonoma 14.4.1
- Python 3.12
To increase performance, I am using the so-called Metal Performance Shaders (MPS) device for PyTorch to leverage the GPU on macOS devices:
device = torch.device("mps")
torch.set_default_device(device)
Also, we will use the same base prompt as we used for training. Putting everything together, this is the pragmatic CLI version of our chatbot:
import torch
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizerdevice = torch.device("mps")
torch.set_default_device(device)
login(token="your-token")
title = "Mental Health Chatbot"
description = "This bot is using a fine-tuned version of meta-llama/Llama-2-7b-chat-hf"
model_id = "meta-llama/Llama-2-7b-chat-hf"
adapter_model_id = "vojay/Llama-2-7b-chat-hf-mental-health"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model.load_adapter(adapter_model_id)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
def get_base_prompt():
return """
You are a knowledgeable and supportive psychologist. You provide emphatic, non-judgmental responses to users seeking
emotional and psychological support. Provide a safe space for users to share and reflect, focus on empathy, active
listening and understanding.
"""
def format_prompt(base, user_message):
return f"[INST] >{base}>{user_message} [/INST]"
def chat_with_llama(prompt):
input_ids = tokenizer.encode(format_prompt(get_base_prompt(), prompt), return_tensors="pt")
input_ids = input_ids.to(device)
output = model.generate(
input_ids,
pad_token_id=tokenizer.eos_token_id,
max_length=2000,
temperature=0.9,
top_k=50,
top_p=0.9
)
decoded = tokenizer.decode(output[0], skip_special_tokens=True)
return decoded.split("[/INST]")[1].lstrip()
while True:
prompt = input("You: ")
response = chat_with_llama(prompt)
print(f"Llama: {response}")
Time to give it a try! Let’s run it with the following chat input:
I'm going through some things with my feelings and myself. I barely sleep and I've been struggling with anxiety, and stress. Can you recommend any coping strategies to avoid medication?
This is the result: