Kingsoft Cloud and Solidigm™ Co-Design an Innovative Object Storage Solution for AI Workloads

Kingsoft Cloud is a multibillion-yuan independent cloud service provider in China.1 The company provides a highly secure, reliable distributed cloud storage service to deliver large storage capacity at a low cost.

TDB
TDB

The world has changed and the AI revolution has pushed boundaries, demanding new requirements for storage architectures. For years, Kingsoft has been a leader in the industry, developing a comprehensive suite of cloud computing services including Kingsoft Cloud for cloud storage platforms and WPS for office software, such as WPS Office. Kingsoft Cloud chose Solidigm SSDs for its latest object storage solution, coined KS3 Extreme. The new KS3 Extreme Speed's bandwidth capabilities dynamically extend based on data volume. The bigger the SSD, the more bandwidth the system can offer.

To keep up with today’s demanding workloads, Kingsoft customers like WPS Office demand faster access to their applications. To address this, Kingsoft expanded storage architecture in both performance and capacity. By replacing HDDs with Solidigm SSDs, Kingsoft improved the bandwidth by more than 100x to over 1 terabit per second (Tbps) per petabyte.2 This is a huge benefit for workloads such as Artificial Intelligence Generated Content (AIGC), animation rendering, and high-performance computing (HPC).

Solidigm offers a broad portfolio of SSDs to help us optimize architecture for demanding applications like AI. We can now provide the right balance of performance, cost and efficiency.
Hongxing Gan, Senior Export of Kingsoft Object Storage Solutions
Kingsoft Cloud KS3 Extreme Speed vs standard object storage and PL1 and PL 2.

Figure 1. Evolution of Kingsoft Cloud's storage architecture

Benefits of Kingsoft Cloud KS3 Extreme Speed

  • KS3 Extreme Speed offers three performance levels based on storage capacity. PL1 provides 200 gigabits per second (Gbps) per petabyte, PL2 provides 500 Gbps per petabyte, and PL3 offers the highest performance at 1Tbps per petabyte.
  • KS3 Extreme Speed hasa redesigned  garbage collection mechanism which enables zero cost space reclamation technology, allowing greater performance and longevity of the SSD.
  • KS3 has made significant improvements to thread scheduling making the process much faster and more efficient. By optimizing the internal scheduling module, it prevents long-tail tasks from blocking requests, thus making response times much quicker.

Figure 1 depicts Kingsoft’s previous architecture as compared to its new architecture. In the old design, there was a file system cache deployed in front of the S3 service because it could not support high throughput needed for intensive applications such as AI. Kingsoft needed a new, more efficient architecture with a way to remove bottlenecks. With the new all-flash design, Kingsoft clients can directly connect object storage to S3 because the object lifetime is set inside S3. This new design offers a better balance of capacity, performance, and cost.

Kingsoft Cloud S3 vs Kingsoft Cloud KS3 Extreme Speed server design.

Figure 2. Kingsoft Cloud S3 vs KS3 Extreme Speed

Business challenge

Today’s AI workloads use larger data sets and create larger models. To make AI simple to deploy and manage, Kingsoft has created an out-of-the-box solution to address a variety of AI workloads.

In specific AI instances, high I/O throughput is crucial for training large models. Faster storage is critical in efficiently training AI models as these systems require high input/output operations per second (IOPS) to process vast amounts of data and perform various calculations in real time.

If we take a large 175 billion-parameter data model as an example, with an assumed training data volume of 40TB, using standard object storage with a throughput capacity of 20 Gbps per petabyte, then loading all training data would take a minimum of 535 minutes. 

With KS3 Extreme Speed Object Storage, boasting a throughput capacity of 1 Tbps per petabyte, the loading of all data could be completed in as little as 11 minutes,3 representing a 48.6x improvement. This is just one example. Other benefits include:

  • Demand for high performance elastic scaling: Data centers must satisfy the high IOPS requirements of training, deep learning, and other applications that have a large number of small files which require low latency data access. This presents a combination of demands on the overall storage system including demand for high IOPS, high concurrency, high reliability, high flexibility, and scalability, which are all needed to solve the complexity and performance issue related to the rapid growth of data.
  • Data lifecycle management requirements: Using typical AI training workflow as an example, the data collection, data cleaning, and tagging process require the processing of a huge amount of unstructured data such as images or text. These types of data require a large amount of storage space and high concurrency sequential read and write access, which can become costly. 
  • No speed reduction even with faults: With KS3 Extreme Speed, organizations can better deal with challenges of system operation under a single machine failure. This is because it contains four major hardware fault troubleshooting systems that can reduce hardware failure damage while allowing the systems to run just as fast as before, even if a failure occurs. 

How Solidigm SSDs provided the right storage

Data pressure brought by emerging services such as AI makes it imperative for Kingsoft Cloud’s hardware to remain up to date. Their original solution of improving storage I/O performance found it feasible to replace SATA SSDs and SATA HDDs, but further scrutiny determined that this was not the most cost-effective or efficient storage. Instead, by fully transitioning to TLC NVMe SSDs, Kingsoft could meet I/O performance requirements.

However, after additional research by the Solidigm team, Kingsoft found an even better storge solution with QLC SSDs. With 33% more bits per cell than TLC, Solidigm QLC SSDs enable 3x8 storage consolidation leading to lower total operational costs. Solidigm offers QLC SSDs ranging from 7.68TB to 60.72TB, with the same endurance and performance as TLC SSDs.

“We had multiple rounds of in-depth communication with Solidigm to understand each other's system characteristics, which provided us a better understanding of the value of all-flash storage. We now can reduce our web application firewall (WAF), and improve overall throughput and stability," says Hongxing Gan.

The collaboration between Kingsoft Cloud and Solidigm produced meaningful results. Both Solidigm TLC and QLC SSDs have been shown to improve the capabilities of Kingsoft’s object storage services and help reduce its operational costs. Solidigm also takes quality and reliability to the next level, with a customer care team that provides Kingsoft overall more effective support.

“Kingsoft Cloud will continue to strengthen its technical and product capabilities based on all-flash media, combined with the development of Solidigm QLC technology, focusing on cost to create high-performance and cost-effective object storage products, and delivering greater value to users in various civil sectors," says Hongxing Gan.

 

 

About the Authors 

Jeniece Wnorowski, Product Marketing Manager at Solidigm, has over 14 years of experience in data center storage solutions. Jeniece got her start in technical marketing at Intel Corporation, then joined Solidigm where she continues to evangelize data center SSD innovations with a variety of companies and partners. Outside of work, Jeniece enjoys spending time with her kids, training for jiu jitsu, and exploring the outdoors.

Wayne Gao is a Principal Engineer as Storage solution architect and worked on CSAL from PF to Alibaba commercial release. Wayne has over 20 years of storage developer experience as a previous DellEMC ECS all flash object storage team member and has 4 US patent filings/grants and 1 EuroSys paper published.

Notes

[1] https://www.macrotrends.net/stocks/charts/KC/kingsoft-cloud-holdings/total-assets

[2] https://mp.weixin.qq.com/

[3] https://mp.weixin.qq.com/