Unlocking the Power of Custom LLMs with LoRA (Low-Rank Adaptation)

Unlocking the Power of Custom LLMs with LoRA (Low-Rank Adaptation)

Introduction

Ever felt like taming a giant language model is a bit like wrestling an octopus?

Large Language Models (LLMs) represent a breakthrough in AI, but their training can be resource-intensive.

Enter LoRA (Low-Rank Adaptation) - your secret sauce for finetuning these colossal linguistic beasts!

Finetuning large language models (LLMs) using low-rank adaptation (LoRA) can unlock improved performance, but how do we extract the maximum benefit?

I've been researching about this technique recently. Found interesting articles.

I will share my insights as part of the research. These can save you a lots of time and help you achieve better outcomes.

Understanding LoRA

The Essence of LoRA

LoRA has rapidly become an essential technique for anyone interested in adapting open-source LLMs.

LoRA revolutionizes weight updates in LLMs. Traditionally, updating the weight matrix, ΔW, for large models demands substantial compute and memory resources. LoRA simplifies this by decomposing ΔW into smaller matrices, drastically reducing memory requirements.

By efficiently updating just a subset of model weights, LoRA sidesteps the GPU memory limitations that often hinder finetuning models with billions of parameters.

I will provide core insights that I've found interesting. You can find more details in this article from Sebastian Raschka.

Core Insights

Consistency in Randomness

LLM training involves inherent randomness when running on GPUs. Surprisingly though, LoRA benchmark results proved remarkably consistent across multiple runs.

QLoRA: A Balance Between Memory and Time

QLoRA, an extension of LoRA, further economizes memory usage at the cost of increased training time, offering a practical solution for GPU memory limitations.

Experiments found:

  • 33% less memory versus default LoRA

  • But 39% higher runtime from added quantization

So QLoRA presents a worthwhile tradeoff for memory-constrained situations. Performance remained stable despite the lower precision.

The Role of Learning Rate Schedulers

Integrating cosine annealing schedulers enhances SGD performance in LoRA training, while its impact on Adam and AdamW optimizers is minimal.

Optimizer Choice

With moderate LoRA rank settings, switching optimizers showed negligible impact.

Specifically, AdamW, SGD + scheduler, or AdamW + scheduler performed virtually the same.

While Adam and AdamW are memory-intensive, their usage in LoRA-trained LLMs doesn't significantly impact memory demands due to the small proportion of trainable parameters.

SGD alone was noticeably worse, but offers little advantage otherwise.

Multi-Epoch Training

Repeated iterations over a static dataset in multi-epoch training can lead to performance degradation, likely owing to overfitting.

Overfitting seems the probable cause, though more research would confirm.

Expanding LoRA Across Layers

Applying LoRA to additional layers increases trainable parameters and memory requirements but can significantly enhance model performance.

Limiting LoRA to certain matrices leaves performance gains on the table. Experiments found:

  • Broadly enabling LoRA increased trainable parameters 5x

  • But it also lifted model performance substantially

  • Further exploration would optimize the exact layer application

The R and Alpha Balance

Adjusting LoRA's rank and alpha parameters is crucial. A common heuristic is setting alpha at twice the rank's value, although experimenting with different ratios can yield better outcomes.

The LoRA rank determines trainable parameters, while alpha scales update influence.

Best practice:

  • Set alpha to about twice the rank value

  • Though higher ranks can use smaller ratios, e.g. 0.5x

  • Tuning both hyperparameters remains critical for peak performance

Feasibility of Single GPU Training

LoRA enables the finetuning of 7 billion parameter models on a single GPU, demonstrating its efficiency and practicality.

Specifically:

  • Using QLoRA rank 256 and alpha 512

  • Reached best accuracy on benchmarks

  • Trained 50k examples in just 3 hours

The Significance of Datasets

Dataset quality is paramount. For instance, a 65B Llama model finetuned on the LIMA dataset outperforms the same model finetuned on the Alpaca dataset.

LoRA for Domain Adaptation

LoRA's effectiveness in domain adaptation is unclear, but it's useful for further pretraining on domain-specific datasets.

Selecting the Best Rank

Determining the optimal rank, r, is a hyperparameter exploration that depends on the LLM and dataset diversity.

The ideal rank likely grows with:

  • Model size

  • Data diversity

  • Desired breath of capabilities

But even small ranks can capture narrow domains, like basic arithmetic. Sweeping rank settings remains key.

Overfitting Mitigation Strategies

Controlling the rank and dataset size, along with adjustments in weight decay and dropout rates, can help mitigate overfitting.

Start by reducing rank, which limits trainable parameters. If that fails:

  • Gather more data

  • Increase weight decay in optimizers

  • Raise the LoRA layers' dropout rate

There's still more to explore around dropout tuning in particular.

Exploring Alternative Optimizers

Adam dominates today, but emerging options like Sophia normalize gradients by curvature rather than variance.

Early results suggest:

  • 2x faster training

  • Better performance after fine-tuning

Comparing optimizers specialized for large models poses an interesting question for future work.

Comparison with Full Finetuning and RLHF

Full finetuning is more resource-intensive and may not yield superior results compared to LoRA-based approaches.

Memory remains the key differentiator limiting wider adoption. Only LoRA and QLoRA have achieved true single GPU training for models over 1B parameters so far.

Combining LoRA Weights

Multiple sets of LoRA weights can be combined, offering flexibility for real-world applications.

No need to re-add each weights set then at inference time.

Layer-wise Optimal Rank Adaptation

Exploring different ranks for various layers adds complexity to hyperparameter optimization but could enhance performance.

Adapting rank and other hyperparameters on a per-layer basis makes theoretical sense. But exponentially more tuning scenarios result.

Unless computational budget is unlimited, stick to single consistent rank settings across layers to rein in the optimization challenge.

Conclusion

In summary, LoRA offers a powerful, efficient approach for customizing LLMs.

By understanding and leveraging its principles, developers can significantly enhance model performance while managing resource constraints.

If you like this article, share it with others ♻️

That would help a lot ❤️

And feel free to follow me for more like this.