A research group led by Professor Choi Jung-wook of the Department of Electronic Engineering, Hanyang University, and a KT research ayo 토토사이트 developed two new algorithm that substantially improve the efficiency of the Large Language Model (LLM). The study focused on maximizing qualitative and quantitative performance while maintaining the efficiency of the high-performance Large Language Model.
Although the Large Language Model significantly increases the weights of the deep neural network and shows outstanding performance, it requires massive storage and computational costs as it generally quantizes weights and activation to 16-bit floating points. Various quantization methods were developed to solve these issues, contributing to lowering storage and computational costs.
However, Professor Choi Jung-wook's ayo 토토사이트 found that while conventional quantization methods maintain quantitative performance in conversational LLMs, qualitative performances like conversational abilities greatly decrease. The first of the two algorithms that the ayo 토토사이트 developed attempted to solve this issue.
Professor Choi's research ayo 토토사이트 suggested a new method called Quantization-aware Direct Preference Optimization (QDPO) to solve the issue. QDPO discovered that the main cause of the degradation of conversational abilities of quantized conversational LLMs has to do with token-flipping. Token-flipping is a phenomenon where a quantized model breaks the flow of conversation by making errors in predicting certain words. The research ayo 토토사이트 developed an efficient optimization method that increases arrays of high-bit precision models using low-bit precision models.
QDPO aims to maintain or improve conversational abilities by automatically generating data for optimization without additional labeling by using a 16-bit full-precision model and a lower-bit-precision. As a result, the ayo 토토사이트 found that while the 4-bit lower-bit-precision model that uses QDPO maintains a similar level of quantitative performance as the conventional one, it shows outstanding performances in a qualitative benchmark that employs the latest model like GPT-4.
In addition, the team developed a method called Rank-Adaptive Low-Rank Adaptation (RA-LoRA) to solve the second issue–performance degradation in low-bit quantized inference. Conventional LoRA methods only adjusted parts of parameters to lower memory usage which couldn't fully correct the quantization errors. RA-LoRA, on the other hand, is designed to maintain its optimal performance by dynamically adjusting the adapter's ranks through rank subspace analysis.
Rank subspace analysis includes a process of selecting ranks appropriate for each layer's characteristics and for its input data. Thanks to this, RA-LoRA can maintain high accuracy with minimal parameters, and this is especially the case in 2-bit quantized inference. According to the findings, RA-LoRA showed a performance superior to conventional methods when it comes to 2-bit parameter-efficient fine-tuning incorporating DeBERTa-V3 and LLaMA-2 models. In particular, RA-LoRA showed its superiority in performance over LoRA with conventional quantization.
The first paper, Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment, was co-authored by Lee Jang-hwan and Park Seong-min, PhD students at the Department of Electronic Engineering at Hanyang University, and Professor Choi Jung-wook and KT research ayo 토토사이트 participated as corresponding authors. The second paper, RA-LoRA: Rank-Adaptive Parameter-Efficient Fine-Tuning for Accurate 2-bit Quantized Large Language Models, was authored by Kim Min-soo, a PhD student at the Department of Electronic Engineering at Hanyang University, while Professor Sung Wong-yong joined as a participating author and Professor as a corresponding author.
The papers were presented in the Main track and the Findings track respectively at the ACL 2024: The 62nd Annual Meeting of the Association for Computational Linguistics, the world's renowned natural language processing conference held from August 11th to 16th.
Click to see the paper:
https://aclanthology.org/2024.acl-long.612.pdf
