Elice Brand Logo

Elice Inc. Releases 190-Billion-Token Korean AI Education Dataset on Hugging Face


[엘리스그룹 참고자료] 허깅페이스에 공개된 엘리스그룹 한국어 파인웹 교육 데이터셋 데모(Korean FineWeb-Edu Demo).png


Caption: Demo of Elice Inc.’s Korean FineWeb-Edu training dataset released on Hugging Face (Korean FineWeb-Edu Demo)

High-quality Korean educational data to support LLM training for academic and educational domains


Elice Inc., an AI full-stack company providing AI infrastructure, cloud, and industry-specific solutions, has released two Korean educational datasets on the global open-source platform Hugging Face. By offering high-quality data optimized for training Korean AI models, Elice Inc. aims to support AI research and development in Korea and abroad, enabling broad use by researchers, developers, and enterprises.

The newly released resources are two datasets designed to enhance large language models’ (LLMs) Korean performance in academic and educational domains: “Korean FineWeb-Edu Demo” and “Korean Web Text Education Dataset (Korean-webtext-edu).”

The Korean FineWeb-Edu Demo is a sample dataset built from 5% of the korean-translated-fineweb-edu-dedup corpus, which is a Korean translation of FineWeb-Edu, an English educational web text corpus. It is designed for training Korean LLMs in academic and educational domains and is provided as a demo to validate data characteristics and potential applications before large-scale training.

The source dataset, korean-translated-fineweb-edu-dedup, is a large-scale text corpus of approximately 190 billion (190B) tokens, equivalent to tens of millions of pages. When used together with multilingual data, it reaches a scale suitable for training foundation models. While the newly released Korean FineWeb-Edu Demo contains only a 5% sample of this corpus, it still ranks among the largest high-quality open-source Korean datasets currently available.

The Korean Web Text Education Dataset (Korean-webtext-edu), released alongside the demo, is built by filtering large-scale Korean web text to retain only content that passes an educational value threshold. Its construction involved evaluating factual accuracy, contextual consistency, and educational suitability, making it well-suited for training Korean AI models.

Large-scale open-source Korean datasets to lower barriers to AI research and expand adoption


This dataset release is grounded in Elice Inc.’s accumulated experience across AI infrastructure, model training, and applications in education and industry. Through these datasets, Elice Inc. seeks to lower the barriers to entry in Korean AI research while supporting broader adoption of Korean AI models in education, research, and the public sector. The company also plans to leverage its strengths in AI infrastructure, cloud, and data engineering to accelerate the development of Korean-specialized AI services and solutions.

Suin Kim, Chief Research Officer (CRO) at Elice Inc., who led the dataset development, said, “Data accessibility and quality are core drivers of progress in AI technology. By applying criteria validated in real-world model training and service environments, Elice Inc. has built high-quality datasets that researchers, developers, and enterprises can easily use.”

Kim added, “We will continue to contribute to the growth of Korean AI research and the broader industrial ecosystem by advancing our capabilities across data, models, and infrastructure.”

#AI
#News
#Hugging Face

Bring innovative DX solutions to your organization

Sign up for a free trial and a business developer will provide you with a personalized DX solution consultation tailored to your business