Elice Brand Logo

Customer

NC AI

Introduced services

NC AI’s NVIDIA B200 Proof of Concept | Computing Infrastructure Validation for an Independent AI Foundation Company

NC AI’s NVIDIA B200 Proof of Concept | Computing Infrastructure Validation for an Independent AI Foundation Company

Customer

NC AI

Introduced services


With the spread of generative AI, large-scale model research environments have become far more complex than before. To ensure stable research today, it’s no longer enough to simply adopt more powerful GPUs; you also need to co-design the network architecture, storage layout, distributed training strategy, and data flows.


NC AI, the team behind the multimodal AI model ‘VARCO’, is one of the five finalists in the Ministry of Science and ICT’s ‘Independent AI Foundation Model (National AI Champion)’ initiative. This national strategic project aims to develop Korean LLMs and multimodal models with domestic capabilities, while driving both large-scale model research and real-world adoption.


Against this backdrop, NC AI conducted a PoC to evaluate the real-world performance of NVIDIA’s next-generation B200 GPUs on its research workloads, and to assess the stability and scalability of Elice Cloud as large-scale model training infrastructure. From initial environment setup through training runs and stability validation, Elice Cloud worked side by side with NC AI to verify the practical viability of a B200-based research environment.


This experiment was not simply about comparing GPU performance. Its goal was to determine how much the existing training pipeline would need to change when moving from an H100-based setup to B200, whether distributed training, data loading, and checkpointing would operate reliably, and what configurations and guidelines would be required in the initial operations phase. In short, it was a fast-track validation of how the specifications and behavior change when switching to B200.


B200 Cluster Configured to Match a Real-World Research Environment


Given the limited PoC timeframe, the B200 cluster was configured slightly smaller than our full-scale production setup, but still capable of running actual model experiments. This allowed us to validate key aspects that must be checked when introducing a new GPU architecture, including distributed training operations, data processing flows, and the interplay between storage and networking.

  • The environment was configured as follows:
  • 2 nodes with 8× NVIDIA B200 GPUs each
  • InfiniBand–based high-bandwidth network
  • Approximately 30 TB of local NVMe storage
  • Docker-based runtime environment
  • Distributed training using PyTorch, Megatron-LM, and NCCL
  • Integration with both internal and external object storage

To minimize the impact of environmental differences on the results, we aligned the PoC setup as closely as possible with NC AI’s existing training methodology.


Validation Approach Aligned with the Actual Research Workflow

The PoC was designed to mirror the exact workflow researchers follow when using a new GPU environment.

  • Driver and library installation
  • Container environment setup and package validation
  • NVMe storage mounting and I/O performance checks
  • InfiniBand bandwidth verification and communication tuning
  • Internal and external S3 data loading and throughput validation
  • Distributed training runs and log-based stability checks
  • FP8 training and evaluation of the latest attention mechanisms
  • Checkpoint saving and restart verification

Throughout this process, the most important criteria were repeatability and reproducibility — ensuring that the same runs could be executed again with consistent results. To that end, we recorded everything from the initial setup steps through training logs and rerun validation.


A Co-Engineering Approach to Boost Reproducibility Across Setup, Execution, and Logging

image-20251103-063203.jpg
▲ Example PoC log snapshot


With a new architecture, there are always aspects of the environment that are hard to understand from documentation alone. To address this, the PoC was carried out in a co-engineering manner: at every stage — setup, execution, and validation — we shared information in real time and immediately reflected any findings back into the environment.


During this process, the team tuned timeout settings for large-scale data uploads, established initial configuration guidelines for the container environment, and adjusted key parameters required for stable InfiniBand connectivity.


As a result, we were able to identify early tuning points in both the distributed training and storage handling stages, and secure a stable training environment for reruns. Based on this experience, ELICE documented an operational guide so that the same practices can be applied to future large-scale training environments.


Verifying the Potential of Next-Generation GPU Migration


Through this PoC, we confirmed that the B200 environment can operate stably under real research workloads. While we did not observe immediate performance gains compared to the H100 at the outset, we regarded this as a typical optimization phase that accompanies the introduction of a new architecture.


What matters more than the raw numbers is securing a stable foundation for migration. We established the initial environment configuration, defined reproducible procedures, and identified the checkpoints required for real-world operations. This, in turn, clarified the direction for future tuning and laid the groundwork for repeatable experiments and broader rollout.


In other words, this PoC was not just a performance test; it was an opportunity for NC AI to gain first-hand experience with a next-generation GPU architecture and explore operational best practices tailored to its research environment.


Although we did not see immediate efficiency improvements, the deeper understanding of the new GPU architecture and the initial setup experience provided crucial insights that will reduce trial-and-error during future migrations. Based on this PoC, NC AI now has a high level of confidence in the potential of the B200 and is actively considering its adoption in future research environments.


AI Infrastructure Transformation with Elice Cloud

Through this PoC, NC AI verified the conditions and operational standards under which a B200-based environment can run stably on real research workloads. The team also identified areas that require initial tuning and, based on those findings, defined a clear direction for further optimization.


Introducing next-generation GPUs is not just about validating hardware performance; it also requires evolving the broader research environment and operational framework. Backed by Asia’s first water-cooled NVIDIA B200 data center infrastructure and extensive hands-on PoC experience with enterprise customers, Elice helps research organizations adopt and scale into new GPU environments with confidence and stability.

Related posts

An all-in-one AI education solution, Start with Elice

From AI infrastructure to platforms, Discover the tailored solution that fits your needs