🚀 LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Aditi Jha, Sam Havens, Jeremey Dohman, Alex Trott, Jacob Portes

Princeton University, MosaicML x Databricks
aditijha@princeton.edu, jacob.portes@databricks.com
GitHub Paper

LIMIT schematic figure
How to finetune and evaluate LLMs for general purpose instruction following? (A) We finetune open-source LLMs MPT-7B and MPT-30B on datasets of varying sizes: Instruct-v1 and v3 which contain 56.2-59.3k instruction samples, and the LIMA dataset which contains 1,000 samples. (B) We then evaluate finetuned models using two paradigms: (1) traditional NLP perplexity-based evaluation on benchmarks such as MMLU and BIG-bench, as well as (2) model-based evaluation (via GPT-4) on open-ended generation.

Abstract

Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.

Main Result

main result (Fig 4)
Models finetuned on the LIMA training set and 1-5k samples from the Instruct training set perform well across both evaluation paradigms. (A) Accuracy of finetuned models on each category of the MosaicML Eval Gauntlet, along with their average scores. MPT-7B and MPT-30B when finetuned on a subset of the Instruct datasets (5k samples from Instruct-v1 for 7B, 1k samples from Instruct-v3 for 30B) combined with the LIMA dataset perform very close to MPT-7B and MPT-30B finetuned on all of Instruct, respectively. (B) Model-based evaluation on the LIMA test set using GPT-4. (Top) MPT-7B finetuned on the combined dataset is preferred over MPT-7B finetuned with LIMA alone by a huge margin. (Bottom) MPT-30B finetuned on the combined dataset is preferred 46.7% over MPT-30B finetuned on LIMA. In both cases, the preference rate of models finetuned on the combined dataset is higher than those finetuned on all of the Instruct datasets.

Citation

@article{jha2023limit,
  title={LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms},
  author={Aditi Jha and Sam Havens and Jeremy Dohmann and Alex Trott
  and Jacob Portes},
  journal={arXiv preprint arXiv:2311.13133},
  year={2023},
}