AI is data Pac-Man. Winning requires a flashy new storage strategy

On Nov 23, 2019

When it comes to data, AI is like Pac-Man. Hard disk drives, NAS, conventional data center and cloud-based storage schemes can’t sate AI’s voracious appetite for speed and capacity, especially for real time. Playing the game today requires a fundamental rethinking of storage as a foundation of machine learning, deep learning, image processing, and neural network success.

“AI and Big Data are dominating every aspect of decision-making and operations,” says Jeff Denworth, vice president of products and co-founder at Vast Data, a provider of all-flash storage and services. “The need for vast amounts of fast data are rendering the traditional storage pyramid obsolete. Applying new thinking to many of the toughest problems helps simplify the storage and access of huge reserves of data, in real time, leading to insights that were not possible before.”

AI driving storage surge

Storage is being reshaped by a variety of new technologies and architectures that can provide the high-bandwidth, massive capacity, fast I/O, low-latency, and flexible scalability needed by various types of AI. Key among them are solid state disk (SSD), flash drives and caching software, NVMe, DAOS, storage-class memory (SCM), and hybrids such as Intel Optane media that close the gap between storage and memory.

Advances like these that keep up with 5G, IoT, streaming analytics, and other speed and data demons of the AI era are fueling a global surge in storage demand.

McKinsey says the combined storage needed by AI applications worldwide will grow tenfold by 2025, from 80 exabytes per year to 845 exabytes. (Exabyte = 1,048,576 terabytes.) That represents market segment growth of 25-30% a year. Healthcare, with 54% adoption of AI forecast by 2023, will be a major driver, as will AI and DL training in many industries.

“An optimized AI and ML workflow requires the right balance of compute, memory, and storage,” says Patrick Moorhead, founder of Moor Insights & Strategy. “There has been a lot of talk about optimized ML compute, but not storage.” That is changing rapidly.

“Feed me – now!” Capacity and bandwidth are key

The reasons are pretty simple: AI applications consume and generate mind-boggling amounts of data – up to hundreds of petabytes or more per project.

For example, Intel research shows:

A smart hospital will generate 3,000 GB/day
A self-driving car will generate over 4,000 GB/day
A connected airplane will generate 5,000 GB/day
A connected factory will generate 1 million GB/day

Consider: Simple facial recognition to identify a man or woman needs roughly 100 million images. Total storage for the required 8-bit files tops 4.5 PB.

But it’s not just sheer volume. This massive data generally depends on real-time analysis to make it valuable. Unfortunately, feeding GPUs and other data-hungry AI compute nodes far outstrips the ability and economics of hard drives.

By one calculation, at 64KB, it would take approximately 5,000 HDDs to deliver the random read IOP/S needed to saturate a GPU server running at 20GB/s. (In contrast, an NVMe flash drive can deliver up to 1000x the performance for this workload.)

Another wrinkle: AI workloads often originate at the edge or network spokes, beyond centralized data centers. This adds an extra architectural challenge for organizations, which must juggle the building of on-premise capacity with temporary cloud-bursting or permanent cloud infrastructure. And wherever they run, “AI workloads present fluctuating access patterns, variable read/write mixes and changing block sizes that all need high throughput and extremely low latency,” says Roger Corell, storage marketing manager at Intel.

All these demands show clear need for a new approach to massive, scalable storage for AI.

Conventional solutions: Not serious contenders

Unfortunately, AI has exposed big gaps in the storage and memory hierarchy.

Traditional NAND SSDs may be strained to meet these requirements across the full data pipeline, Corell says. Moorhead notes that most file systems are not optimized for high-performance storage technologies like NVMe Flash, for example. They also don’t provide adequate data protection or support temporary cloud data movement, he notes.

Other approaches suffer the technology shortcomings of HDDs, and add some of their own:

Off-the-shelf NAS can work as a fast and easy solution for AI. But decreasing performance and increasing cost make this a poor choose for large projects. Just as well: Most NAS appliances cannot be scaled beyond a few PBs.

Cloud Service Providers (CSP) may not offer the performance and configuration flexibility needed for specialized AI workloads. Other potential obstacles: limited network and storage bandwidth which affects latency and throughput, and “noisy” hosting neighbors slowing application performance.

4 key technologies

So that’s what doesn’t work. What does?

Four underlying technologies also play key roles in modern, flash-based storage robust enough to handle the demands of AI.

SSDs, broadly speaking, offer the speed and low-latency needed for many AI applications. New systems such as Intel Optane DC SSDs remove performance AI bottlenecks by delivering high throughput at low queue depth, a necessity for caching temporary data in Big Data workloads. Improved read latency performance enables faster and more consistent time to data analytics insight.

QLC 3D NAND technology accommodates 33% more storage in every cell compared to other forms of flash, making it the most economical way to build flash storage for AI. Slower write performance and endurance are offset by new data schemes that handle larger block sizes much faster. And combining the density of QLC with new, space-efficient form factors such as EDSFF can enable up to 3X more drives in the same rack space.

NVMe over fabrics enables new types of storage scaling and disaggregation. All CPUs don’t need to coordinate, and can operate independently, using multiple SSDs connected via Data Center Ethernet or InfiniBand, with DAS (direct attached storage) latency.

Storage class memory (SCM) boosts flash storage performance by It also acts as a shock absorber by placing data in flash in a way that doesn’t wear down the drive.

Enabling new AI businesses and applications

AI requires not only a reset in thinking of how storage is used, but about how it’s built and deployed. Advanced storage technologies are making possible new applications as well as new AI-based services and businesses.

Take Vast Data. The New York company provides next-generation storage appliances and services. All-flash design makes storage fast and scalable enough for ML and demanding apps like HPC, financial services, and life sciences. Vast’s “universal storage architecture” is essentially one big pool of flash for analytics and archiving. The advantage, explains Denworth, the co-founder: “Everything is always available, so a company can easily unleash new insights and ask questions of long-term data.”

Offering such AI-ready storage (“Tier 1 performance with Tier 5 economies”, as the company likes to say) was not possible before 2018. That kind of performance is simply not technologically or economically feasible with conventional HDD and cloud services.

Combining NVMe, SCM, and QLC with global compression allows for a radical new, “disaggregated” approach to data and storage, Denworth explains. Essentially, every server owns and shares every piece of media and data.

Zebra: Radiology analysis as service

Another good example of a business based on AI and modern storage is Zebra Medical Vision. Started in 2014 by a group of MIT grads, the Israeli company offers automated, real-time medical image diagnosis as a service. For $1 or less per scan, radiologists and other healthcare providers get accurate help detecting and analyzing cancers and other medical conditions from CT scans and X-rays.

Innovative storage technology lets Zebra, a Vast customer, deliver on its motto: “” The company’s AI1 solution uses millions of imaging and correlated clinical records to create high-performance algorithms, which can identify problems, high-risk patients, prioritize urgent cases, and manage costs.

Zebra says its business is only possible with AI enabled by new storage technologies. Real-time diagnosis and analysis, says Eyal , CTO at Zebra Medical, demands “performance superior to what is possible with traditional NAS.” He adds that having “a simple, scalable appliance that requires no effort to deploy and manage” lets the company focus on rapid growth, not AI infrastructure.

Strategy: Different stages, different needs

Just as there’s no single type of AI, there’s no single best “one-size fits all” strategy for AI storage. Like all good games, it’s a moving target. So smart planning requires disciplined analysis done in the context of wider AI infrastructure design.

For starters, AI’s storage needs change throughout the life cycle. During training, systems must store massive volumes of data as they refine algorithms. Flash and NVMe are well suited to this I/O intensive stage. During inference, only data that’s needed for future training must be stored. In contrast, deep learning systems need constant access to data to retrain themselves.

In some cases, outputs from AI systems might be small enough to be handled by tiered modern enterprise storage systems. In most cases, however, front-end AI functions will need updated, flexible storage.

The resources below can help you choose the best technology combination for your AI applications.

Build vs. buy

Should you buy, build, repurpose, or outsource storage resources to implement image recognition, natural language processing, or predictive maintenance workloads?

As always, the “right combination” depends on your application and situation. This chart provides some basic guidelines for making your best sourcing choice.

Gobble or get gobbled

Sexier AI algorithms and chips may get more attention. But smart players recognize the key role played by storage and storage infrastructure. Gobble or get gobbled.