Hands-on Tutorials

Fine-Tune an Open-Source LLM on Your Own Data with LoRA

Train a custom AI model on your data without needing a GPU cluster.

TechiesPad

February 26, 2026·9 min read·advanced

Introduction: Why Fine-Tuning Matters

Large language models like Llama 3 and Mistral are incredibly powerful out of the box, but they are generalists. They know a little about everything and a lot about nothing specific to your business. If you need a model that understands your company's internal documentation, speaks in your brand voice, or handles a domain-specific task like classifying medical records or generating legal summaries, you need to fine-tune.

Full fine-tuning updates every parameter in the model. For a 7-billion-parameter model, that means storing and updating 7 billion floating-point numbers during training. This requires multiple high-end GPUs, hundreds of gigabytes of VRAM, and significant compute budgets. For most teams, this is simply not feasible.

Enter LoRA (Low-Rank Adaptation). LoRA freezes the original model weights and injects small trainable matrices into each layer. Instead of updating 7 billion parameters, you train only 1-10 million. The result: you can fine-tune a 7B model on a single consumer GPU in under an hour. Combined with QLoRA (quantized LoRA), which loads the base model in 4-bit precision, you can fine-tune on a GPU with as little as 16 GB of VRAM.

In this tutorial, you will fine-tune an open-source LLM on a custom dataset using LoRA and QLoRA. Every code block is complete and runnable. By the end, you will have a model that is specialized to your data and ready for inference.

What Is LoRA?

LoRA, introduced in the LoRA paper by Hu et al. (2021), is based on a simple but powerful insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a full weight matrix W of dimensions d x d, LoRA decomposes the update into two smaller matrices: A (d x r) and B (r x d), where r is much smaller than d (typically 8, 16, or 32).

The original weight matrix is frozen. During the forward pass, the output is computed as W*x + B*A*x. Only A and B are updated during training. This means the trainable parameter count drops by 100-1000x. After training, the LoRA matrices can be merged back into the original weights, producing a standard model with zero inference overhead.

QLoRA takes this further. It loads the base model in 4-bit NormalFloat (NF4) quantization, reducing memory usage by roughly 4x. The LoRA adapters themselves remain in full precision (float16 or bfloat16), so training quality stays high. The combination of 4-bit base model + low-rank adapters makes it possible to fine-tune a 7B model in about 6 GB of VRAM and a 13B model in about 10 GB.

Prerequisites

Before you begin, make sure you have the following:

Python 3.10 or later
An NVIDIA GPU with at least 16 GB of VRAM (T4, RTX 3090, RTX 4090, or A100). You can also use a free Google Colab instance with a T4 GPU.
A Hugging Face account with access to a gated model (Llama 3 or Mistral). Request access on the model card page.
CUDA 12.1+ and cuDNN installed
A custom dataset in JSON or CSV format (we will show you the expected format below)

GPU Requirements

QLoRA with a 7B model requires a minimum of 16 GB VRAM (NVIDIA T4 or equivalent). For best results with 13B models or larger batch sizes, an A100 (40 GB or 80 GB) is recommended. If you do not have local GPU access, use Google Colab (free T4) or a cloud provider like Lambda Labs, RunPod, or Vast.ai.

Step 1: Install Dependencies

We rely on four key libraries: transformers for model loading, peft for LoRA (see the PEFT library documentation), bitsandbytes for 4-bit quantization, and trl (see the TRL documentation) for the SFTTrainer, which simplifies supervised fine-tuning.

install.sh

pip install --upgrade \
  torch \
  transformers \
  datasets \
  accelerate \
  peft \
  bitsandbytes \
  trl \
  scipy \
  sentencepiece \
  protobuf

If you are on Colab, these packages may already be partially installed. Running the command above ensures you have compatible versions. After installation, log in to Hugging Face to access gated models:

from huggingface_hub import login

# Paste your Hugging Face access token when prompted
login()

Step 2: Prepare Your Dataset

The SFTTrainer expects data in a conversational format or a simple text field. The most common approach for instruction-tuning is the Alpaca-style format with instruction, input, and output fields. Here is how to prepare and format your data:

prepare_data.py

import json
from datasets import Dataset

# Example: load your custom data from a JSON file
# Each entry should have "instruction", "input" (optional), and "output"
raw_data = [
    {
        "instruction": "Summarize the following customer support ticket.",
        "input": "Customer reports that their order #4521 arrived damaged. The package was visibly crushed and two of the three items inside were broken. They are requesting a full refund and return shipping label.",
        "output": "Order #4521 arrived with a crushed package and 2 of 3 items broken. Customer requests a full refund and return label."
    },
    {
        "instruction": "Classify the sentiment of this product review.",
        "input": "Absolutely love this keyboard! The mechanical switches feel amazing and the RGB lighting is gorgeous. Best purchase I've made all year.",
        "output": "Positive"
    },
    {
        "instruction": "Write a professional response to this customer complaint.",
        "input": "I've been waiting 3 weeks for my order and nobody answers the phone!",
        "output": "Dear Customer, I sincerely apologize for the delay and the difficulty reaching our support team. I have escalated your order for immediate processing and will personally follow up within 24 hours with a tracking number. As a gesture of goodwill, I am applying a 15% discount to your next order. Thank you for your patience."
    }
]

def format_prompt(example):
    """Format each example into the chat template expected by the model."""
    if example["input"]:
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Input:\n{example['input']}\n\n"
            f"### Response:\n{example['output']}"
        )
    else:
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Response:\n{example['output']}"
        )
    return {"text": text}

# Create a Hugging Face Dataset
dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_prompt)

# Split into train and validation sets
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")
print(f"\nSample formatted prompt:\n{train_dataset[0]['text']}")

Data Quality Is Everything

The quality of your fine-tuned model is directly proportional to the quality of your training data. A few hundred high-quality, well-formatted examples will outperform thousands of noisy or inconsistent ones. Spend time cleaning and validating your data before training.

Step 3: Load the Base Model with QLoRA

We will load the base model in 4-bit quantization using bitsandbytes. This reduces the model's memory footprint from roughly 14 GB (float16) to about 4 GB for a 7B model. The quantized model serves as the frozen backbone onto which we attach LoRA adapters.

load_model.py

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

# Choose your base model
model_name = "mistralai/Mistral-7B-v0.3"
# Alternative: "meta-llama/Meta-Llama-3-8B"

# Configure 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in bfloat16
    bnb_4bit_use_double_quant=True,       # Nested quantization for extra savings
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",                     # Automatically distribute across GPUs
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # Use Flash Attention 2 if available
)

# Disable caching for training (saves memory)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model loaded: {model_name}")
print(f"Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

The key settings here are load_in_4bit=True which enables QLoRA, and bnb_4bit_quant_type="nf4" which uses the NormalFloat4 data type. This was shown in the QLoRA paper to be optimal for quantized fine-tuning. The bnb_4bit_use_double_quant=True flag enables nested quantization, which quantizes the quantization constants themselves, saving an additional 0.4 bits per parameter.

Training metrics dashboard — Monitor training loss to detect overfitting early

Step 4: Configure LoRA Parameters

Now we configure the LoRA adapters. There are three critical hyperparameters to understand:

r (rank): The rank of the low-rank matrices. Higher values mean more trainable parameters and greater capacity to learn, but also more memory usage. Common values are 8, 16, 32, or 64. For most tasks, 16 is a strong default.
lora_alpha: A scaling factor applied to the LoRA update. The effective learning rate for LoRA is proportional to lora_alpha / r. A common heuristic is to set lora_alpha = 2 * r, though many practitioners simply set it to 16 or 32.
lora_dropout: Dropout applied to the LoRA layers for regularization. A value of 0.05 to 0.1 works well for most cases. Set to 0 if you have a large dataset and are not worried about overfitting.

configure_lora.py

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Prepare the quantized model for training
# This handles gradient checkpointing and layer normalization in float32
model = prepare_model_for_kbit_training(model)

# Define LoRA configuration
lora_config = LoraConfig(
    r=16,                        # Rank of the update matrices
    lora_alpha=32,               # Scaling factor (alpha / r = effective LR multiplier)
    lora_dropout=0.05,           # Dropout for regularization
    bias="none",                 # Don't train bias terms
    task_type=TaskType.CAUSAL_LM,  # Task type: causal language modeling
    target_modules=[             # Which layers to apply LoRA to
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameter count
model.print_trainable_parameters()
# Example output: "trainable params: 13,631,488 || all params: 3,752,071,168 || trainable%: 0.3633"

The target_modules parameter specifies which linear layers receive LoRA adapters. For Llama and Mistral architectures, targeting all attention projections (q_proj, k_proj, v_proj, o_proj) plus the MLP layers (gate_proj, up_proj, down_proj) gives the best results. If memory is tight, you can start with just the attention projections.

Step 5: Train with SFTTrainer

The SFTTrainer from the TRL library handles the training loop, data collation, and logging. It wraps the standard Hugging Face Trainer with additional features for supervised fine-tuning, including automatic prompt formatting and packing of short examples for efficiency.

train.py

from transformers import TrainingArguments
from trl import SFTTrainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,       # Effective batch size = 4 * 4 = 16
    gradient_checkpointing=True,         # Saves memory at cost of speed
    optim="paged_adamw_32bit",           # Paged optimizer for memory efficiency
    learning_rate=2e-4,
    lr_scheduler_type="cosine",          # Cosine annealing schedule
    warmup_ratio=0.03,                   # 3% warmup steps
    weight_decay=0.001,
    fp16=False,
    bf16=True,                           # Use bfloat16 mixed precision
    max_grad_norm=0.3,                   # Gradient clipping
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=3,                  # Keep only last 3 checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="tensorboard",             # Log to TensorBoard
    seed=42,
)

# Create the trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer,
    peft_config=lora_config,
    max_seq_length=1024,                 # Maximum sequence length
    packing=True,                        # Pack short examples together
    dataset_text_field="text",           # Column name with formatted text
)

# Start training
print("Starting training...")
trainer.train()

# Save the final adapter
trainer.save_model("./final_adapter")
print("Training complete. Adapter saved to ./final_adapter")

A few notes on the training configuration. The gradient_accumulation_steps=4 means that gradients are accumulated over 4 mini-batches before a weight update, giving an effective batch size of 16. This lets you simulate larger batches without needing extra memory. The paged_adamw_32bit optimizer uses CPU RAM as overflow when GPU memory runs low, preventing out-of-memory crashes. The packing=True option concatenates multiple short examples into a single sequence up to max_seq_length, which dramatically improves throughput when your examples are short.

Monitoring Training

Watch your training loss and eval loss closely. If training loss decreases but eval loss starts increasing, you are overfitting. Remedies include reducing the number of epochs, increasing dropout, reducing the rank r, or adding more training data. Launch TensorBoard with: tensorboard --logdir ./results/runs

Step 6: Merge and Save the Final Model

After training, you have two options. You can keep the LoRA adapter separate (useful if you want to swap adapters dynamically) or merge it into the base model for a single standalone model. Merging is recommended for deployment because it eliminates the LoRA overhead during inference.

merge_model.py

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Reload the base model in float16 (not quantized) for merging
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, "./final_adapter")

# Merge LoRA weights into the base model
model = model.merge_and_unload()

# Save the merged model
output_dir = "./merged_model"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Merged model saved to {output_dir}")
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e9:.2f}B parameters")

Note that for merging, we reload the base model in float16 (not 4-bit). This is because the merge operation needs the full-precision weights. If you are running on a machine with limited RAM, you can do this step on a CPU-only machine or a high-RAM cloud instance since the merge itself does not require a GPU.

Step 7: Test Your Fine-Tuned Model

Now for the moment of truth. Load the merged model and run inference to verify that it has learned from your data:

test_model.py

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load the merged model
model_path = "./merged_model"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
)

# Test with a prompt in the same format used during training
test_prompt = """### Instruction:
Summarize the following customer support ticket.

### Input:
Customer called about their subscription renewal. They were charged $49.99 but expected the promotional rate of $29.99. They have been a member for 3 years and are considering cancellation if the price is not adjusted.

### Response:
"""

result = pipe(test_prompt)
generated_text = result[0]["generated_text"]

# Extract only the response portion
response = generated_text.split("### Response:\n")[-1].strip()
print(f"Model response:\n{response}")

If the model produces relevant, well-formatted responses that reflect the style of your training data, your fine-tuning was successful. If the output is incoherent or generic, consider increasing the number of training examples, training for more epochs, or increasing the LoRA rank.

Pushing Your Model to Hugging Face Hub

Once you are happy with your model, you can share it on the Hugging Face Hub so others can use it or so you can deploy it from anywhere:

push_to_hub.py

# Push the merged model to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model", private=True)
tokenizer.push_to_hub("your-username/my-fine-tuned-model", private=True)

# Alternatively, push only the adapter (much smaller upload)
# trainer.model.push_to_hub("your-username/my-lora-adapter", private=True)

Tips for Better Results

Fine-tuning is as much art as science. Here are practical tips drawn from real-world experience:

Start with a small dataset (100-500 examples) to validate your pipeline end to end. Scale up only after confirming the model learns the desired behavior.
Use consistent formatting across all training examples. If you use the Alpaca template, use it for every single example without exception.
Set max_seq_length to match the longest example in your dataset. Anything longer gets truncated, which can corrupt training examples.
Learning rate matters a lot. If your model outputs gibberish, your learning rate is probably too high. Start with 2e-4 for QLoRA and reduce if needed.
For classification or extraction tasks where outputs are short, 1-2 epochs is usually sufficient. For generation tasks where the model needs to learn a specific style, 3-5 epochs may be needed.
Compare your fine-tuned model against the base model on a held-out test set. This gives you a clear measure of improvement and helps justify the effort.
If you are fine-tuning for chat, use the model's native chat template rather than the Alpaca format. Check the model card for the correct format.
Consider using the DPO (Direct Preference Optimization) trainer from TRL as a second stage after SFT to align the model with human preferences.

Continue learning with these related articles:

Key Takeaways

LoRA reduces trainable parameters by 100-1000x by injecting low-rank adapter matrices into frozen model layers. This makes fine-tuning accessible on consumer hardware.
QLoRA adds 4-bit quantization to the base model, cutting VRAM requirements by 4x while maintaining training quality through full-precision adapters.
Data quality trumps data quantity. A few hundred clean, well-formatted examples will produce better results than thousands of noisy ones.
The full pipeline from data preparation to deployment can run on a single T4 GPU in under an hour. Use Google Colab if you do not have local GPU access.
After training, merge the adapter into the base model for zero-overhead inference, or keep adapters separate for flexibility in swapping between different fine-tuned behaviors.
Monitor eval loss during training to catch overfitting early. Use gradient checkpointing and paged optimizers to maximize what you can fit in limited VRAM.

Fine-tuning open-source LLMs with LoRA has fundamentally changed what is possible for small teams and individual developers. You no longer need a GPU cluster or a six-figure compute budget to build a model that is genuinely specialized for your use case. With the tools and techniques covered in this tutorial, you can go from raw data to a deployed custom model in an afternoon.

fine-tuningLoRAQLoRALLMLlamaMistralHugging Face

TechiesPad

Editorial Team

AI and machine learning insights, tutorials, and tools — distilled for developers and engineers.

Container shipping port representing Docker containerization

Hands-on Tutorials

Deploy a Machine Learning Model to Production with Docker and FastAPI

Learn how to containerize and deploy a machine learning model as a REST API using Docker and FastAPI. Includes Dockerfile, health checks, and production tips.

TechiesPad·Mar 12, 2026·9 min read

Chat interface on modern device representing AI chatbot

Hands-on Tutorials

Build a RAG Chatbot with LangChain and Pinecone

Build a RAG-powered chatbot that answers questions using your own documents. Uses LangChain, Pinecone, and OpenAI with full Python code.

TechiesPad·Mar 5, 2026·9 min read

Code on screen representing API development

Hands-on Tutorials

Build a Sentiment Analysis API with Python and FastAPI

Learn how to build a sentiment analysis REST API using Python, FastAPI, and Hugging Face Transformers. Includes full working code and deployment tips.

TechiesPad·Feb 19, 2026·8 min read

Introduction: Why Fine-Tuning Matters

What Is LoRA?

Prerequisites

Step 1: Install Dependencies

Step 2: Prepare Your Dataset

Step 3: Load the Base Model with QLoRA

Step 4: Configure LoRA Parameters

Step 5: Train with SFTTrainer

Step 6: Merge and Save the Final Model

Step 7: Test Your Fine-Tuned Model

Pushing Your Model to Hugging Face Hub

Tips for Better Results

Related Reading

Key Takeaways

Related Articles

Deploy a Machine Learning Model to Production with Docker and FastAPI

Build a RAG Chatbot with LangChain and Pinecone

Build a Sentiment Analysis API with Python and FastAPI