Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice | by Volker Janz

By skyforbes Sep 2, 2025 No Comments

In more general terms, RAG is a very important concept, especially when crafting more specialized LLM applications. This concept can avoid the risk of false positives, giving wrong answers, or hallucinations in general.

These are some open-source projects that might be helpful when approaching RAG in one of your projects:

txtai: All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows.
LangChain: LangChain is a framework for developing applications powered by large language models (LLMs).
Qdrant: Vector Search Engine for the next generation of AI applications.
Weaviate: Weaviate is a cloud-native, open source vector database that is robust, fast, and scalable.

Of course, with the potential value of this approach for LLM-based applications, there are many more open- and close-source alternatives, but with these, you should be able to get your research on the topic started.

Hackathon Guide: Building Your First Mental Health Chatbot

Disclaimer: the following chapter should help people, who are interested in developing an AI-powered chatbot to get started. This is not meant to be a sophisticated, production-ready mental health support solution.

I therapy session, generated with DALL·E 3

To inspire you to get started with your own AI-driven project in regards to mental health, let’s fine-tune a LLM and create our own, basic AI-powered mental health support chatbot step by step.

Since fine-tuning a LLM needs a decent environment, I am using a Jupyter notebook running in a Google Cloud Vertex AI Workbench instance. Vertex AI Workbench instances are Jupyter notebook-based development environments for the entire data science workflow. These instances are prepackaged with JupyterLab and have a preinstalled suite of deep learning packages, including support for the TensorFlow and PyTorch frameworks. You can configure different types of instances based on your needs.

To finish the fine-tuning process in a reasonable amount of time and to have access to some modern features like FlashAttention (explained later), I used the following machine type:

GPU type: NVIDIA A100 80GB
Number of GPUs: 1
12 vCPUs
6 cores
170 GB memory

Running this instance costs around $4.193 hourly. Since you pay for instances for what you use, it means there are no upfront costs and per-second billing. The fine-tuning process will take around 30 minutes, so the total costs are around $2.

Vertex AI Workbench instance for fine-tuning, source: by author

You can also run the process on your local machine or using Google Colab, which is a web-based platform built around Jupyter Notebooks. You access Colab through your web browser, no software installation needed on your own computer.

Google Colab, source: https://colab.research.google.com/

The code you run in Colab actually executes on powerful machines in Google’s cloud, not your personal computer. This gives you access to advanced hardware like GPUs and TPUs, which are great for speeding up data analysis and machine learning tasks.

Colab provides a user-friendly environment with powerful computing resources in the cloud, all accessible through your web browser, and what is really cool: you can get started for free. The free tier already offers access to hardware accelerator options however, free Colab resources are not guaranteed and not unlimited, and usage limits sometimes fluctuate. These interruptions might be frustrating, but this is the price for having a sophisticated, free notebook platform.

Talking about prices, of course you can upgrade to another plan, including Pay As You Go or Colab Pro.

Google Colab pricing, source: https://colab.research.google.com/

For this example, the free version with a T4 GPU would not offer enough resources for the fine-tuning process, which is why I chose a more sophisticated Vertex AI Workbench instance. However, Colab is a great way to get started with such projects so I still wanted to mention this option.

Parameter-Efficient Fine-Tuning (PEFT) of Llama 2 with Mental Health Counseling Data

With a NVIDIA A100 80GB Tensor-Core-GPU, we have a really good basis for our fine-tuning process.

Like explained earlier, fine-tuning LLMs is often costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient alternatives by only fine-tuning a small number of model parameters.

In this example, we will use the meta-llama/Llama-2-7b-chat-hf by Meta hosted on Hugging Face. This model uses 7 billion parameters, optimized for dialogue. To fine-tune this model, we will use the Amod/mental_health_counseling_conversations dataset, also available on Hugging Face, which contains a collection of questions and answers sourced from two online counseling and therapy platforms, covering a wide range of mental health topics.

The basic idea is: we load the model, tokenizer and dataset from Hugging Face. Then we create a LoraConfig with settings based on the previously mentioned Quantized LoRA (QLoRA) paper, then we prepare the model for training, configure a so called SFTTrainer (Supervised Fine-Tuning Trainer) for the fine-tuning process, train the model, save the model and then push this fine-tuned model back to Hugging Face so that we can use it in an application later.

As explained, I am running the process within a Jupyter notebook, so let’s break down each individual step of the fine-tuning procedure.

First, we install all required libraries, including PyTorch and the toolkits provided by Hugging Face. The environment this is running in allows us to use FlashAttention based on the FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness paper, which requires CUDA 11, NVCC, and a Turing or Ampere GPU. This particular dependency has to be installed after torch, so we run it in a second step separately.

pip install torch torchvision datasets transformers tokenizers bitsandbytes peft accelerate trl

pip install flash-attn

Then, we import everything we need for the fine-tuning process:

import gc
import torchfrom datasets import load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from trl import SFTTrainer

Next, we setup some variables to specify the model we are going to use, the dataset but also the Hugging Face User Access Token. This token is used to interact with the Hugging Face platform to download and publish models, datasets and more. To create a token, you can register for free at https://huggingface.co/, then open up your account settings and select Access Tokens from the menu. For this process, we need a token with write access since we are going to publish the fine-tuned model to Hugging Face later.

If you want to try the fine-tuning yourself, just replace the placeholder in the following code with your own token.

# see: https://huggingface.co/docs/hub/security-tokens
# must be write token to push model later
hf_token = "your-token"# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
base_model = "meta-llama/Llama-2-7b-chat-hf"
# https://huggingface.co/datasets/Amod/mental_health_counseling_conversations
fine_tuning_dataset = "Amod/mental_health_counseling_conversations"
# name for output model
target_model = "vojay/Llama-2-7b-chat-hf-mental-health"

For the next part it is important to understand, that prompts are usually created by multiple elements following a specific template. This depends on the model of course and the llama-2-chat model uses the following format, based on the Llama 2 paper, to define system and instruction prompts:

[INST] >
{{ system_prompt }}
>
{{ user_message }} [/INST] {{ model_response }}

This format might look cryptic at first, but it becomes more clear when looking at the individual elements:

~~: beginning of sequence.~~

: end of sequence.
>: beginning of system message. >: end of system message. [INST]: beginning of instructions. [/INST]: end of instructions. system_prompt: overall context for model responses. user_message: user instructions for generating output. model_response: expected model response for training only.

When we train the model, we must follow this format, so the next step is to define a proper template and functions to transform the sample data accordingly. Let’s start with the system or base prompt to create an overall context: def get_base_prompt(): return """ You are a knowledgeable and supportive psychologist. You provide emphatic, non-judgmental responses to users seeking emotional and psychological support. Provide a safe space for users to share and reflect, focus on empathy, active listening and understanding. """ We will re-use this base prompt later to enrich user input before we send it to the LLM for evaluation. This would be a great opportunity for projects in this context, since the base prompt could be improved to make the LLM respond much better. Now let’s define a function to format training data accordingly: def format_prompt(base, context, response): return f"[INST] >{base}>{context} [/INST] {response} " The next part is the fine-tuning part itself, which is wrapped into a function, so that we first simply define the process and then execute it as the next step in the notebook: def train_mental_health_model(): model = AutoModelForCausalLM.from_pretrained( base_model, token=hf_token, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=False ), torch_dtype=torch.float16, # reduce memory usage attn_implementation="flash_attention_2" # optimize for tensor cores (NVIDIA A100) )# LoRA config based on QLoRA paper peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=8, bias="none", task_type="CAUSAL_LM" ) model = prepare_model_for_kbit_training(model) model = get_peft_model(model, peft_config) args = TrainingArguments( output_dir=target_model, # model output directory overwrite_output_dir=True, # overwrite output if exists num_train_epochs=2, # number of epochs to train per_device_train_batch_size=2, # batch size per device during training gradient_checkpointing=True, # save memory but causes slower training logging_steps=10, # log every 10 steps learning_rate=1e-4, # learning rate max_grad_norm=0.3, # max gradient norm based on QLoRA paper warmup_ratio=0.03, # warmup ratio based on QLoRA paper optim="paged_adamw_8bit", # memory-efficient variant of AdamW optimizer lr_scheduler_type="constant", # constant learning rate save_strategy="epoch", # save at the end of each epoch evaluation_strategy="epoch", # evaluation at the end of each epoch, fp16=True, # use fp16 16-bitprecision training instead of 32-bit to save memory tf32=True # optimize for tensor cores (NVIDIA A100) ) tokenizer = AutoTokenizer.from_pretrained(base_model, token=hf_token) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" # limit samples to reduce memory usage dataset = load_dataset(fine_tuning_dataset, split="train") train_dataset = dataset.select(range(2000)) eval_dataset = dataset.select(range(2000, 2500)) trainer = SFTTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, peft_config=peft_config, max_seq_length=1024, tokenizer=tokenizer, formatting_func=lambda entry: format_prompt(get_base_prompt(), entry["Context"], entry["Response"]), packing=True, args=args ) gc.collect() torch.cuda.empty_cache() trainer.train() trainer.save_model() trainer.push_to_hub(target_model, token=hf_token) Press enter or click to view image in full size Notebook in Vertex AI Workbench instance, source: by author I added comments to all training arguments to make the configuration transparent. However, the specifics depend on the environment you are running the training in and on the input model and dataset, so adjustments might be necessary. Let’s have a closer look to how the process works in detail. With AutoModelForCausalLM.from_pretrained we are loading the model and by setting the quantization_config we transform it to 4-bit weights and activations through quantization, which offers benefits in regards to performance. By setting attn_implementation to flash_attention_2. FlashAttention-2 is a faster and more efficient implementation of the standard attention mechanism that can significantly speed up inference by additionally parallelizing the attention computation over sequence length and partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them. The LoraConfig configures the Low-Rank Adaptation (LoRA) process. lora_alpha controls the scaling factor for the weight matrices, the lora_dropout sets the dropout probability of the LoRA layers, r controls the rank of the low-rank matrices, bias determines how bias terms are handled and task_type reflects the task of the fine-tuned model. Once the LoraConfig is set up, we create the PeftModel with the get_peft_model() function. With the prepared model, the next step is to prepare the training. To do so, we create a TrainingArguments object, which controls all major aspects of the training process, including: output_dir=target_model, # model output directory overwrite_output_dir=True, # overwrite output if exists num_train_epochs=2, # number of epochs to train per_device_train_batch_size=2, # batch size per device during training gradient_checkpointing=True, # save memory but causes slower training logging_steps=10, # log every 10 steps learning_rate=1e-4, # learning rate max_grad_norm=0.3, # max gradient norm based on QLoRA paper warmup_ratio=0.03, # warmup ratio based on QLoRA paper optim="paged_adamw_8bit", # memory-efficient variant of AdamW optimizer lr_scheduler_type="constant", # constant learning rate save_strategy="epoch", # save at the end of each epoch evaluation_strategy="epoch", # evaluation at the end of each epoch, fp16=True, # use fp16 16-bitprecision training instead of 32-bit to save memory tf32=True # optimize for tensor cores (NVIDIA A100) Afterwards, we create the tokenizer for the model with AutoTokenizer.from_pretrained. The next step is to load the mental health dataset. Here we are limiting the sample size to reduce memory usage and speed up the training. With all that, we can instantiate the SFTTrainer, train, save and publish the fine-tuned model, using trainer.push_to_hub . In the next step, we call the train_mental_health_model() and can simply watch while the magic happens: train_mental_health_model() Press enter or click to view image in full size Fine-tuning process, source: by author I pushed the fine-tuned model to Hugging Face, so you can fetch it from there if you want to skip the fine-tuning process. Press enter or click to view image in full size Fine-tuned model on Hugging Face, source: by author Keep in mind that this fine-tuned model is actually an adapter for the base model. Which means, in order to use it, we need to load the base model and apply this fine-tuning adapter: model_id = "meta-llama/Llama-2-7b-chat-hf" adapter_model_id = "vojay/Llama-2-7b-chat-hf-mental-health"model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16) model.load_adapter(adapter_model_id) For a hackathon project, improving this fine-tuning process would be a great way to create a more sophisticated mental health support application. Create a Chatbot with the Fine-tuned Model Now that we have a fine-tuned model, let’s create a chatbot to make use of it. To keep it simple, we run a pragmatic CLI bot within the local environment. Let’s start by having a closer look at how to create the project and how dependencies are managed in general. For this, we are using Poetry, a tool for dependency management and packaging in Python. The three main tasks Poetry can help you with are: Build, Publish and Track. The idea is to have a deterministic way to manage dependencies, to share your project and to track dependency states. Poetry also handles the creation of virtual environments for you. Per default, those are in a centralized folder within your system. However, if you prefer to have the virtual environment of project in the project folder, like I do, it is a simple config change: poetry config virtualenvs.in-project true With poetry new you can then create a new Python project. It will create a virtual environment linking you systems default Python. If you combine this with pyenv, you get a flexible way to create projects using specific versions. Alternatively, you can also tell Poetry directly which Python version to use: poetry env use /full/path/to/python. Once you have a new project, you can use poetry add to add dependencies to it. Let’s start by creating the project for our bot and add all necessary dependencies: poetry new mental-health-bot cd mental-health-botpoetry add huggingface_hub poetry add adapters poetry add transformers poetry add adapters poetry add peft poetry add torch With that, we can create the app.py main file with the code to run our bot. As before, if you want to run this on your own, please replace the Hugging Face token placeholder with your own token. This time, a read-only token is sufficient as we simply fetch the base and fine-tuned models from Hugging Face without uploading anything. Another thing to mention is, that I am running this on the following environment: Apple MacBook Pro CPU: M1 Max Memory: 64 GB macOS: Sonoma 14.4.1 Python 3.12 To increase performance, I am using the so-called Metal Performance Shaders (MPS) device for PyTorch to leverage the GPU on macOS devices: device = torch.device("mps") torch.set_default_device(device) Also, we will use the same base prompt as we used for training. Putting everything together, this is the pragmatic CLI version of our chatbot: import torch from huggingface_hub import login from transformers import AutoModelForCausalLM, AutoTokenizerdevice = torch.device("mps") torch.set_default_device(device) login(token="your-token") title = "Mental Health Chatbot" description = "This bot is using a fine-tuned version of meta-llama/Llama-2-7b-chat-hf" model_id = "meta-llama/Llama-2-7b-chat-hf" adapter_model_id = "vojay/Llama-2-7b-chat-hf-mental-health" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16) model.load_adapter(adapter_model_id) model.to(device) tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" def get_base_prompt(): return """ You are a knowledgeable and supportive psychologist. You provide emphatic, non-judgmental responses to users seeking emotional and psychological support. Provide a safe space for users to share and reflect, focus on empathy, active listening and understanding. """ def format_prompt(base, user_message): return f"[INST] >{base}>{user_message} [/INST]" def chat_with_llama(prompt): input_ids = tokenizer.encode(format_prompt(get_base_prompt(), prompt), return_tensors="pt") input_ids = input_ids.to(device) output = model.generate( input_ids, pad_token_id=tokenizer.eos_token_id, max_length=2000, temperature=0.9, top_k=50, top_p=0.9 ) decoded = tokenizer.decode(output[0], skip_special_tokens=True) return decoded.split("[/INST]")[1].lstrip() while True: prompt = input("You: ") response = chat_with_llama(prompt) print(f"Llama: {response}") Time to give it a try! Let’s run it with the following chat input: I'm going through some things with my feelings and myself. I barely sleep and I've been struggling with anxiety, and stress. Can you recommend any coping strategies to avoid medication? This is the result:

~~Like this: Like Loading... Post navigation Discover Profitable AI Side Hustles Revolutionize Affiliate Strategy with Perplexity AI~~

By skyforbes Related Posts AI Updates Im trying to create J.A.R.V.I.S. with the custom GPT stuff in chatgpt plus but cant get the voice model to change to arbor skyforbes Nov 27, 2025 AI Updates ChatGPT Flirted with me skyforbes Nov 27, 2025 AI Updates ChatGPT advanced voice ignoring interruptions. skyforbes Nov 27, 2025 Leave a ReplyCancel reply

Search Recent Posts Donnie Darko, r/movies and Me You’ve already spent a final day with people you assumed you’d see again. Short Story Format and Free Short Story Template in Word or Google Doc Im trying to create J.A.R.V.I.S. with the custom GPT stuff in chatgpt plus but cant get the voice model to change to arbor ANYONE HERE TRIED THE UNHINGED VERSION OF GROK Given how strong Gemini’s new Nano Banana Pro is, I tested against my personal image-benchmark Israel’s President says Settler Violence against Palestinians must end TIL that Galileo likely observed Neptune 233 years before its discovery: In two of his drawings of his observations, he plotted an object in the location where Neptune would have been on those dates. However, he likely mistook it for a star because of its slow rate of motion. Kesha treats Bay Area bar to an impromptu show, says the popo shut her down Why does the CGI in “Pirates of the Caribbean” (2006) look better than most Marvel movies released in 2024? The fae couldn’t have red blood because of the iron in hemoglobin. It would likely be blue, like an horseshoe crab, which uses copper instead. How to craft a story with story structure – the easy way (Remote) Account Manager, Install Base Sales ChatGPT Flirted with me Three years with ChatGPT Plus and it’s never been this bad. Nano Banana Pro vs. Sora Trump administration designates 4 left-wing European networks as terrorist organizations TIL Keiko, the orca who starred in Free Willy, was actually freed years after the movie, but after struggling to adapt to life in the wild and repeatedly seeking out humans for companionship, he died of pneumonia just a year after his release. White House posts twisted TikTok using Wicked’s ‘Defying Gravity’ to celebrate ICE deportations You can’t name any Christmas movie better than Christmas Vacation. Rocks can’t be thrown away, just moved. https://docs.google.com/forms/d/1vVKjkmVxQyCrBxxPTR8_X5Dh-nuFwcYaq5IOUHrGyhw/edit?pli=1 Skin Tightening Treatment in Kukatpally, Hyderabad Oracle Database Administrator (DBA), TS/SCI with a CI Polygraph Security Clearance Required ChatGPT advanced voice ignoring interruptions. Found the AI prompt that makes everything 10x more interesting Gemini 3 Pro vs GPT-5.1 Codex-Max vs : hands-on coding comparison Ukraine’s army chief: Measures developed to counter Russian plans in Pokrovsk and Myrnohrad TIL the portrayal of “Minnesota nice” and distinctive regional accents in the movie Fargo (1996) caused locals to receive repeated requests from tourists to say “Yah, you betcha” and other tag lines from the movie. Elton John says he isn’t ‘giving up hope’ after losing sight in right eye Looking for very specific movie recommendations Midlife crisis is when you’re old enough to buy what you’ve always wanted but now have to decide if it was worth it. Make Your Own Lab Experiment OC #challenge #artchallenge #oc #characterdesign #ocartist #occhallenge Help. Health Software Test Automation Engineer Made an iOS app for quick AI chats without the login hassle Help Needed For Prompt Is Gemini Advanced better than ChatGPT Plus for studying Merz asks Zelensky to reduce outflow of young Ukrainian men to Germany TIL Sony holds the record for making the largest CRT monitor ever, called PVM-4300. It was made in 1989, with a 43-inch diagonal display and a weight of around 200 kilograms. There’s only one known unit still exists, which was rediscovered in 2022 in Osaka, Japan and acquired by a YouTuber. Killer Mike – Reagan [Rap/Protest] What is a detail in a movie that you’re embarrassed to admit you didn’t notice sooner? Does bread grow mold only on the expiration date, or do we just check for mold when it reaches that date? 40+ Gift Ideas Made by Kids 500 Error Hamster Senior Software Engineer – Cloud Infrastructure & Automation A tip for improving email writing Stop using ChatGPT like Google. Use it like a coach. Categories AI Updates Architecture aww Business Career Advice Chat GPT Decorating Design Downloads Funny gaming GeminiAI Health Challenges in the United States Interiors Jobs Medicaid Men's Health movies Music newyorkcity Racing Reviews Showerthoughts Sport Technology TikTok todayilearned Top 10 Trending Uncategorized Video WallStreetBets Windows worldnews

Minds and Machines — AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice | by Volker Janz

Hackathon Guide: Building Your First Mental Health Chatbot

Parameter-Efficient Fine-Tuning (PEFT) of Llama 2 with Mental Health Counseling Data

Create a Chatbot with the Fine-tuned Model

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

Donnie Darko, r/movies and Me

You’ve already spent a final day with people you assumed you’d see again.

Short Story Format and Free Short Story Template in Word or Google Doc

Archives