[D] Training a Vision model on a Text-Only Dataset using Axolotl

By skyforbes Nov 28, 2025 No Comments

I'm planning to fine-tune LLaMA 3.2 11B Instruct on a JSONL dataset of domain-specific question-answer pairs — purely text, no images. The goal is to improve its instruction-following behavior for specialized text tasks, while still retaining its ability to handle multimodal inputs like OCR and image-based queries.

I am using Axolotl
https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-3-vision/lora-11b.yaml
in examples we have a sample .yaml file for this
“`
base_model: alpindale/Llama-3.2-11B-Vision-Instruct

<h1>optionally might have model_type or tokenizer_type or processor_type</h1>

processor_type: AutoProcessor

<h1>Automatically upload checkpoint and final model to HF</h1>

<h1>hub_model_id: username/custom_model_name</h1>

<h1>these 3 lines are needed for now to handle vision chat templates w images</h1>

skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

chat_template: llama3_2_vision
datasets:
– path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./outputs/out

adapter: lora
lora_model_dir:

sequence_len: 8192
pad_to_sequence_len: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1

<h1>flash_attention: true # use for text-only mode</h1>

sdp_attention: true

warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0

<h1>save_first_step: true # uncomment this to validate checkpoint saving works with your config</h1>

“`
based on which I have made a similar .yaml file

“`
base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

<h1>Vision-chat template handling</h1>

<h1>skip_prepare_dataset: true</h1>

<h1>remove_unused_columns: false</h1>

<h1>sample_packing: false</h1>

chat_template: llama3_2_vision

datasets:
– path: <path_to_dataset>
type: chat_template
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
system:
– system
user:
– user
assistant:
– assistant
train_on_inputs: false

output_dir: <path_to_output_directory>

<h1>Training parameters</h1>

sequence_len: 8192
pad_to_sequence_len: false
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1

optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
weight_decay: 0.0
warmup_ratio: 0.1

<h1>Precision & performance</h1>

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true # text-only mode

<h1>sdp_attention: true</h1>

<h1>Checkpointing</h1>

evals_per_epoch: 1
saves_per_epoch: 1
save_first_step: true
save_total_limit: 3

weight_decay: 0.0
special_tokens:
pad_token: <|end_of_text|>

“`

but when i run
axolotl train config.yaml
and I have processor_type:
base_model: alpindale/Llama-3.2-11B-Vision-Instruct processor_type: AutoProcessor tokenizer_config: <path_to_custom_tokenizer> tokenizer_type: AutoTokenizer
I get the error
KeyError: 'Indexing with integers is not available when using Python based feature extractors'

but when i remove the field
base_model: alpindale/Llama-3.2-11B-Vision-Instruct tokenizer_config: <path_to_custom_tokenizer> tokenizer_type: AutoTokenizer

or even
“`
base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer>

<h1>Vision-chat template handling</h1>

skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

“`

I get the error
AttributeError: 'MllamaTextSelfAttention' object has no attribute 'is_causal'

What happened here?
How does one do this?
Will this fine-tuning lead to loss of Vision Capabilities of the model?
Is there a guide to writing config.yaml files for different models?

Python Version: 3.12
Axolotl Version: Latest
ataset: a .jsonl with
{ "messages": [ {"role": "system", "content": "<system_prompt>"}, {"role": "user", "content": "<question>"}, {"role": "assistant", "content": "<answer>"} ] }
which was previously used to fine tune Llama3.1 8B using the following config.yaml

“`
base_model: NousResearch/Meta-Llama-3.1-8B-Instruct
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

chat_template: llama3
datasets:
– path: <path_to_dataset>
type: chat_template
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
system:
– system
user:
– user
assistant:
– assistant
train_on_inputs: false

output_dir: <path_to_output_directory>

sequence_len: 2048
sample_packing: true

gradient_accumulation_steps: 8
micro_batch_size: 2
num_epochs: 4

optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

bf16: auto
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
save_only_model: false

logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
saves_per_epoch: 1
save_total_limit: 3
weight_decay: 0.0
special_tokens:
pad_token: <|end_of_text|>
“`

<p>Thank you.I'm planning to fine-tune LLaMA 3.2 11B Instruct on a JSONL dataset of domain-specific question-answer pairs — purely text, no images. The goal is to improve its instruction-following behavior for specialized text tasks, while still retaining its ability to handle multimodal inputs like OCR and image-based queries.</p>

<p>I am using Axolotl
<a href="https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-3-vision/lora-11b.yaml">https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-3-vision/lora-11b.yaml</a>
in examples we have a sample .yaml file for this
“`
base_model: alpindale/Llama-3.2-11B-Vision-Instruct

<h1>optionally might have model_type or tokenizer_type or processor_type</h1>

processor_type: AutoProcessor

<h1>Automatically upload checkpoint and final model to HF</h1>

<h1>hub_model_id: username/custom_model_name</h1>

<h1>these 3 lines are needed for now to handle vision chat templates w images</h1>

skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

chat_template: llama3_2_vision
datasets:
– path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./outputs/out

adapter: lora
lora_model_dir:

sequence_len: 8192
pad_to_sequence_len: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1

<h1>flash_attention: true # use for text-only mode</h1>

sdp_attention: true

warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0

<h1>save_first_step: true # uncomment this to validate checkpoint saving works with your config</h1>

“`
based on which I have made a similar .yaml file

“`
base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

<h1>Vision-chat template handling</h1>

<h1>skip_prepare_dataset: true</h1>

<h1>remove_unused_columns: false</h1>

<h1>sample_packing: false</h1>

chat_template: llama3_2_vision

output_dir: <path_to_output_directory>

<h1>Training parameters</h1>

sequence_len: 8192
pad_to_sequence_len: false
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1

optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
weight_decay: 0.0
warmup_ratio: 0.1

<h1>Precision & performance</h1>

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true # text-only mode

<h1>sdp_attention: true</h1>

<h1>Checkpointing</h1>

evals_per_epoch: 1
saves_per_epoch: 1
save_first_step: true
save_total_limit: 3

weight_decay: 0.0
special_tokens:
pad_token: <|end_of_text|>

“`

but when i remove the field
base_model: alpindale/Llama-3.2-11B-Vision-Instruct tokenizer_config: <path_to_custom_tokenizer> tokenizer_type: AutoTokenizer

or even
“`
base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer></p>

<h1>Vision-chat template handling</h1>

skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

“`

I get the error
AttributeError: 'MllamaTextSelfAttention' object has no attribute 'is_causal'

What happened here?
How does one do this?
Will this fine-tuning lead to loss of Vision Capabilities of the model?
Is there a guide to writing config.yaml files for different models?

“`
base_model: NousResearch/Meta-Llama-3.1-8B-Instruct
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

output_dir: <path_to_output_directory>

sequence_len: 2048
sample_packing: true

gradient_accumulation_steps: 8
micro_batch_size: 2
num_epochs: 4

optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

bf16: auto
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
save_only_model: false

logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
saves_per_epoch: 1
save_total_limit: 3
weight_decay: 0.0
special_tokens:
pad_token: <|end_of_text|>
“`

Thank you.

By skyforbes

MachineLearning

[D] Training a Vision model on a Text-Only Dataset using Axolotl

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

looking for moroccan musicians

Appreciation for Shutter Island (2010)

The ISS is the most visible man-made object from space — not only because it is already in space, but because it includes and surrounds the most probable observer.

Choose some tropes, I’ll test your mental state!

Archives

[D] Training a Vision model on a Text-Only Dataset using Axolotl

Like this:

By skyforbes

Related Posts

[D] Inverse hyperbolic sine as an activation function and its anti-derivative as a loss function

[D] Question and Answer Position Detection

[D] Got burned by an Apple ICLR paper — it was withdrawn after my Public Comment.

Leave a ReplyCancel reply

You Missed

looking for moroccan musicians

Appreciation for Shutter Island (2010)

The ISS is the most visible man-made object from space — not only because it is already in space, but because it includes and surrounds the most probable observer.

Choose some tropes, I’ll test your mental state!