[Benchmark] Elice's Koreanized LLama3.1 8b instruct

In this report, we share our journey of mitigating a common issue in open-source language models—language mixing in responses—by fine-tuning LLama3.1-8b-instruct. We highlight our methodology, implementation, and experimental results that not only resolved the language mixing issue but also enhanced the precision and succinctness of the model’s responses.

1. Introduction

Open-source large language models (LLMs) often suffer from language mixing in their responses. This problem becomes more pronounced in smaller models where limited parameters can lead to unstable language handling. In our experiments with LLama3.1-8b-instruct, we observed significant language mixing in Korean queries, which prompted our efforts to fine-tune the model for better native language performance while maintaining its multilingual capabilities.

2. Problem Statement

During our initial testing of over 1,000 Korean language queries, approximately 27% (270 responses) exhibited language mixing. This inconsistency detracted from the user experience, especially for native Korean speakers who expect clear, monolingual responses.

3. Mitigation Strategy

3.1 Fine-Tuning Dataset Composition

To address the language mixing issue, we designed a two-pronged fine-tuning strategy:

80% Korean Instruct Dataset: A carefully curated dataset of Korean instructions to strengthen native language performance.
20% Diverse Language Data: A mix primarily of English content and other languages to ensure the model retains its multilingual abilities.

3.2 Overcoming Catastrophic Forgetting

A major challenge when fine-tuning is catastrophic forgetting. Fine-tuning with a smaller dataset can cause the model to overfit and lose its ability to perform other tasks—such as code generation. We observed this phenomenon in our experiments, where the model occasionally forgot how to generate code.

Key Factors for Success:

Low Learning Rate: We maintained an extremely low learning rate (approximately 1e-7) to minimize overfitting and preserve the model’s original capabilities.
Dataset Diversity: Including a variety of languages in the fine-tuning set helped balance specialization in Korean with overall performance across multiple tasks.

4. Implementation Details

Our fine-tuning process was implemented using the Huggingface Transformers library. We utilized Nvidia A100 GPUs on an on-demand instance provided by the Elice Cloud platform.
Tip: You can claim 3.5 hours of free credit (based on A100 GPU, 5USD) on Elice Cloud, which is sufficient to reproduce this experiment.

5. Experimental Results

After fine-tuning, we re-tested the model on the same set of 1,000 Korean queries with the following outcomes:

Language Mixing Rate: Reduced from 27% to <1%.
Benchmark Performance: Our fine-tuned model achieved nearly identical scores on popular benchmarks such as MMLU, HumanEval, MBPP, and GSM-8K. Although there was a minor performance dip, the differences were negligible in practical applications.

In addition to resolving the language mixing issue, our model now returns more precise and succinct answers, enhancing the overall user experience.

6. Real-World Examples

Our fine-tuned model not only mitigates the language mixing issue but also delivers more accurate, precise, and concise responses. The table below compares responses from the original model with those from our fine-tuned version.
표 이미지.webp

7. Discussion

Our experiments demonstrate that a well-planned fine-tuning strategy—emphasizing a balanced dataset, low learning rate, and diversity in training data—can effectively resolve language mixing issues without sacrificing the model’s overall performance. Moreover, the fine-tuned model delivers responses that are not only linguistically accurate but also more precise and succinct, making it more useful for practical applications.

8. Conclusion

This technical report has detailed our approach to fine-tuning LLama3.1-8b-instruct to address the pervasive issue of language mixing in open source LLMs. By employing an 80/20 mix of Korean and other language data, carefully controlling the learning rate, and leveraging a robust fine-tuning framework, we have achieved a model that performs reliably in its native language while maintaining broad benchmark performance.
If you are facing similar challenges, consider applying these insights to optimize your own fine-tuning process.

For more technical insights and updates, stay tuned to our tech blog.