Introduction: Why Fine-Tuning Matters
Large language models like Llama 3 and Mistral are incredibly powerful out of the box, but they are generalists. They know a little about everything and a lot about nothing specific to your business. If you need a model that understands your company's internal documentation, speaks in your brand voice, or handles a domain-specific task like classifying medical records or generating legal summaries, you need to fine-tune.
Full fine-tuning updates every parameter in the model. For a 7-billion-parameter model, that means storing and updating 7 billion floating-point numbers during training. This requires multiple high-end GPUs, hundreds of gigabytes of VRAM, and significant compute budgets. For most teams, this is simply not feasible.
Enter LoRA (Low-Rank Adaptation). LoRA freezes the original model weights and injects small trainable matrices into each layer. Instead of updating 7 billion parameters, you train only 1-10 million. The result: you can fine-tune a 7B model on a single consumer GPU in under an hour. Combined with QLoRA (quantized LoRA), which loads the base model in 4-bit precision, you can fine-tune on a GPU with as little as 16 GB of VRAM.
In this tutorial, you will fine-tune an open-source LLM on a custom dataset using LoRA and QLoRA. Every code block is complete and runnable. By the end, you will have a model that is specialized to your data and ready for inference.
What Is LoRA?
LoRA, introduced in the LoRA paper by Hu et al. (2021), is based on a simple but powerful insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a full weight matrix W of dimensions d x d, LoRA decomposes the update into two smaller matrices: A (d x r) and B (r x d), where r is much smaller than d (typically 8, 16, or 32).
The original weight matrix is frozen. During the forward pass, the output is computed as W*x + B*A*x. Only A and B are updated during training. This means the trainable parameter count drops by 100-1000x. After training, the LoRA matrices can be merged back into the original weights, producing a standard model with zero inference overhead.
QLoRA takes this further. It loads the base model in 4-bit NormalFloat (NF4) quantization, reducing memory usage by roughly 4x. The LoRA adapters themselves remain in full precision (float16 or bfloat16), so training quality stays high. The combination of 4-bit base model + low-rank adapters makes it possible to fine-tune a 7B model in about 6 GB of VRAM and a 13B model in about 10 GB.
Prerequisites
Before you begin, make sure you have the following:
- Python 3.10 or later
- An NVIDIA GPU with at least 16 GB of VRAM (T4, RTX 3090, RTX 4090, or A100). You can also use a free Google Colab instance with a T4 GPU.
- A Hugging Face account with access to a gated model (Llama 3 or Mistral). Request access on the model card page.
- CUDA 12.1+ and cuDNN installed
- A custom dataset in JSON or CSV format (we will show you the expected format below)
GPU Requirements
QLoRA with a 7B model requires a minimum of 16 GB VRAM (NVIDIA T4 or equivalent). For best results with 13B models or larger batch sizes, an A100 (40 GB or 80 GB) is recommended. If you do not have local GPU access, use Google Colab (free T4) or a cloud provider like Lambda Labs, RunPod, or Vast.ai.
Step 1: Install Dependencies
We rely on four key libraries: transformers for model loading, peft for LoRA (see the PEFT library documentation), bitsandbytes for 4-bit quantization, and trl (see the TRL documentation) for the SFTTrainer, which simplifies supervised fine-tuning.
pip install --upgrade \
torch \
transformers \
datasets \
accelerate \
peft \
bitsandbytes \
trl \
scipy \
sentencepiece \
protobufIf you are on Colab, these packages may already be partially installed. Running the command above ensures you have compatible versions. After installation, log in to Hugging Face to access gated models:
from huggingface_hub import login
# Paste your Hugging Face access token when prompted
login()Step 2: Prepare Your Dataset
The SFTTrainer expects data in a conversational format or a simple text field. The most common approach for instruction-tuning is the Alpaca-style format with instruction, input, and output fields. Here is how to prepare and format your data:
import json
from datasets import Dataset
# Example: load your custom data from a JSON file
# Each entry should have "instruction", "input" (optional), and "output"
raw_data = [
{
"instruction": "Summarize the following customer support ticket.",
"input": "Customer reports that their order #4521 arrived damaged. The package was visibly crushed and two of the three items inside were broken. They are requesting a full refund and return shipping label.",
"output": "Order #4521 arrived with a crushed package and 2 of 3 items broken. Customer requests a full refund and return label."
},
{
"instruction": "Classify the sentiment of this product review.",
"input": "Absolutely love this keyboard! The mechanical switches feel amazing and the RGB lighting is gorgeous. Best purchase I've made all year.",
"output": "Positive"
},
{
"instruction": "Write a professional response to this customer complaint.",
"input": "I've been waiting 3 weeks for my order and nobody answers the phone!",
"output": "Dear Customer, I sincerely apologize for the delay and the difficulty reaching our support team. I have escalated your order for immediate processing and will personally follow up within 24 hours with a tracking number. As a gesture of goodwill, I am applying a 15% discount to your next order. Thank you for your patience."
}
]
def format_prompt(example):
"""Format each example into the chat template expected by the model."""
if example["input"]:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
# Create a Hugging Face Dataset
dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_prompt)
# Split into train and validation sets
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]
print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")
print(f"\nSample formatted prompt:\n{train_dataset[0]['text']}")Data Quality Is Everything
The quality of your fine-tuned model is directly proportional to the quality of your training data. A few hundred high-quality, well-formatted examples will outperform thousands of noisy or inconsistent ones. Spend time cleaning and validating your data before training.
Step 3: Load the Base Model with QLoRA
We will load the base model in 4-bit quantization using bitsandbytes. This reduces the model's memory footprint from roughly 14 GB (float16) to about 4 GB for a 7B model. The quantized model serves as the frozen backbone onto which we attach LoRA adapters.
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
# Choose your base model
model_name = "mistralai/Mistral-7B-v0.3"
# Alternative: "meta-llama/Meta-Llama-3-8B"
# Configure 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
bnb_4bit_use_double_quant=True, # Nested quantization for extra savings
)
# Load the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto", # Automatically distribute across GPUs
trust_remote_code=True,
attn_implementation="flash_attention_2", # Use Flash Attention 2 if available
)
# Disable caching for training (saves memory)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
print(f"Model loaded: {model_name}")
print(f"Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")The key settings here are load_in_4bit=True which enables QLoRA, and bnb_4bit_quant_type="nf4" which uses the NormalFloat4 data type. This was shown in the QLoRA paper to be optimal for quantized fine-tuning. The bnb_4bit_use_double_quant=True flag enables nested quantization, which quantizes the quantization constants themselves, saving an additional 0.4 bits per parameter.

Step 4: Configure LoRA Parameters
Now we configure the LoRA adapters. There are three critical hyperparameters to understand:
- r (rank): The rank of the low-rank matrices. Higher values mean more trainable parameters and greater capacity to learn, but also more memory usage. Common values are 8, 16, 32, or 64. For most tasks, 16 is a strong default.
- lora_alpha: A scaling factor applied to the LoRA update. The effective learning rate for LoRA is proportional to lora_alpha / r. A common heuristic is to set lora_alpha = 2 * r, though many practitioners simply set it to 16 or 32.
- lora_dropout: Dropout applied to the LoRA layers for regularization. A value of 0.05 to 0.1 works well for most cases. Set to 0 if you have a large dataset and are not worried about overfitting.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
# Prepare the quantized model for training
# This handles gradient checkpointing and layer normalization in float32
model = prepare_model_for_kbit_training(model)
# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of the update matrices
lora_alpha=32, # Scaling factor (alpha / r = effective LR multiplier)
lora_dropout=0.05, # Dropout for regularization
bias="none", # Don't train bias terms
task_type=TaskType.CAUSAL_LM, # Task type: causal language modeling
target_modules=[ # Which layers to apply LoRA to
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Print trainable parameter count
model.print_trainable_parameters()
# Example output: "trainable params: 13,631,488 || all params: 3,752,071,168 || trainable%: 0.3633"The target_modules parameter specifies which linear layers receive LoRA adapters. For Llama and Mistral architectures, targeting all attention projections (q_proj, k_proj, v_proj, o_proj) plus the MLP layers (gate_proj, up_proj, down_proj) gives the best results. If memory is tight, you can start with just the attention projections.
Step 5: Train with SFTTrainer
The SFTTrainer from the TRL library handles the training loop, data collation, and logging. It wraps the standard Hugging Face Trainer with additional features for supervised fine-tuning, including automatic prompt formatting and packing of short examples for efficiency.
from transformers import TrainingArguments
from trl import SFTTrainer
# Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
gradient_checkpointing=True, # Saves memory at cost of speed
optim="paged_adamw_32bit", # Paged optimizer for memory efficiency
learning_rate=2e-4,
lr_scheduler_type="cosine", # Cosine annealing schedule
warmup_ratio=0.03, # 3% warmup steps
weight_decay=0.001,
fp16=False,
bf16=True, # Use bfloat16 mixed precision
max_grad_norm=0.3, # Gradient clipping
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
save_total_limit=3, # Keep only last 3 checkpoints
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
report_to="tensorboard", # Log to TensorBoard
seed=42,
)
# Create the trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
processing_class=tokenizer,
peft_config=lora_config,
max_seq_length=1024, # Maximum sequence length
packing=True, # Pack short examples together
dataset_text_field="text", # Column name with formatted text
)
# Start training
print("Starting training...")
trainer.train()
# Save the final adapter
trainer.save_model("./final_adapter")
print("Training complete. Adapter saved to ./final_adapter")A few notes on the training configuration. The gradient_accumulation_steps=4 means that gradients are accumulated over 4 mini-batches before a weight update, giving an effective batch size of 16. This lets you simulate larger batches without needing extra memory. The paged_adamw_32bit optimizer uses CPU RAM as overflow when GPU memory runs low, preventing out-of-memory crashes. The packing=True option concatenates multiple short examples into a single sequence up to max_seq_length, which dramatically improves throughput when your examples are short.
Monitoring Training
Watch your training loss and eval loss closely. If training loss decreases but eval loss starts increasing, you are overfitting. Remedies include reducing the number of epochs, increasing dropout, reducing the rank r, or adding more training data. Launch TensorBoard with: tensorboard --logdir ./results/runs
Step 6: Merge and Save the Final Model
After training, you have two options. You can keep the LoRA adapter separate (useful if you want to swap adapters dynamically) or merge it into the base model for a single standalone model. Merging is recommended for deployment because it eliminates the LoRA overhead during inference.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Reload the base model in float16 (not quantized) for merging
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, "./final_adapter")
# Merge LoRA weights into the base model
model = model.merge_and_unload()
# Save the merged model
output_dir = "./merged_model"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Merged model saved to {output_dir}")
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e9:.2f}B parameters")Note that for merging, we reload the base model in float16 (not 4-bit). This is because the merge operation needs the full-precision weights. If you are running on a machine with limited RAM, you can do this step on a CPU-only machine or a high-RAM cloud instance since the merge itself does not require a GPU.
Step 7: Test Your Fine-Tuned Model
Now for the moment of truth. Load the merged model and run inference to verify that it has learned from your data:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# Load the merged model
model_path = "./merged_model"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Create a text generation pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
)
# Test with a prompt in the same format used during training
test_prompt = """### Instruction:
Summarize the following customer support ticket.
### Input:
Customer called about their subscription renewal. They were charged $49.99 but expected the promotional rate of $29.99. They have been a member for 3 years and are considering cancellation if the price is not adjusted.
### Response:
"""
result = pipe(test_prompt)
generated_text = result[0]["generated_text"]
# Extract only the response portion
response = generated_text.split("### Response:\n")[-1].strip()
print(f"Model response:\n{response}")If the model produces relevant, well-formatted responses that reflect the style of your training data, your fine-tuning was successful. If the output is incoherent or generic, consider increasing the number of training examples, training for more epochs, or increasing the LoRA rank.
Pushing Your Model to Hugging Face Hub
Once you are happy with your model, you can share it on the Hugging Face Hub so others can use it or so you can deploy it from anywhere:
# Push the merged model to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model", private=True)
tokenizer.push_to_hub("your-username/my-fine-tuned-model", private=True)
# Alternatively, push only the adapter (much smaller upload)
# trainer.model.push_to_hub("your-username/my-lora-adapter", private=True)Tips for Better Results
Fine-tuning is as much art as science. Here are practical tips drawn from real-world experience:
- Start with a small dataset (100-500 examples) to validate your pipeline end to end. Scale up only after confirming the model learns the desired behavior.
- Use consistent formatting across all training examples. If you use the Alpaca template, use it for every single example without exception.
- Set max_seq_length to match the longest example in your dataset. Anything longer gets truncated, which can corrupt training examples.
- Learning rate matters a lot. If your model outputs gibberish, your learning rate is probably too high. Start with 2e-4 for QLoRA and reduce if needed.
- For classification or extraction tasks where outputs are short, 1-2 epochs is usually sufficient. For generation tasks where the model needs to learn a specific style, 3-5 epochs may be needed.
- Compare your fine-tuned model against the base model on a held-out test set. This gives you a clear measure of improvement and helps justify the effort.
- If you are fine-tuning for chat, use the model's native chat template rather than the Alpaca format. Check the model card for the correct format.
- Consider using the DPO (Direct Preference Optimization) trainer from TRL as a second stage after SFT to align the model with human preferences.
Related Reading
Continue learning with these related articles:
- How transformers work under the hood
- When to use RAG vs fine-tuning
- Deploy your fine-tuned model with Docker
Key Takeaways
- LoRA reduces trainable parameters by 100-1000x by injecting low-rank adapter matrices into frozen model layers. This makes fine-tuning accessible on consumer hardware.
- QLoRA adds 4-bit quantization to the base model, cutting VRAM requirements by 4x while maintaining training quality through full-precision adapters.
- Data quality trumps data quantity. A few hundred clean, well-formatted examples will produce better results than thousands of noisy ones.
- The full pipeline from data preparation to deployment can run on a single T4 GPU in under an hour. Use Google Colab if you do not have local GPU access.
- After training, merge the adapter into the base model for zero-overhead inference, or keep adapters separate for flexibility in swapping between different fine-tuned behaviors.
- Monitor eval loss during training to catch overfitting early. Use gradient checkpointing and paged optimizers to maximize what you can fit in limited VRAM.
Fine-tuning open-source LLMs with LoRA has fundamentally changed what is possible for small teams and individual developers. You no longer need a GPU cluster or a six-figure compute budget to build a model that is genuinely specialized for your use case. With the tools and techniques covered in this tutorial, you can go from raw data to a deployed custom model in an afternoon.



