Nvidia Open Source It’s Deep Learning Inference Compiler “NVDLA”

On Oct 7, 2019

The most part of the computing effort for deep learning inference is based on mathematical operations which can be mostly grouped into the four-part that are convolutions; activations; pooling; and normalization.

These all four share a few characteristics that make them well suited for special-purpose hardware implementation: their memory access patterns are extremely predictable & they are readily parallelized.

For designing a new custom hardware accelerators for deep learning is clearly popular, but achieving the state-of-the-art performance, and efficiency with a new design is a complex and challenging problem.

In order to help developers to advance the adoption of efficient AI inferencing in custom hardware designs, in 2017 Nvidia opened the source for the hardware design of the NVIDIA Deep Learning Accelerator.

NVIDIA Deep Learning Accelerator is both scalable and highly configurable; it consists of many great features like the modular design that maintains flexibility & simplifies integration and it also promotes standardized, open architecture to address the computational demands of inference.

The same NVIDIA Deep Learning Accelerator (NVDLA) is shipped in the NVIDIA Jetson AGX Xavier Developer Kit, where it provides best-in-class peak efficiency of 7.9 TOPS/W for AI.

With the open-source release of NVIDIA Deep Learning Accelerator (NVDLA) optimizing compiler on GitHub, system architects and software teams now have a starting point with the complete source for the world’s first fully open software and hardware inference platform.

NVDLA hardware provides a simple, flexible, robust inference acceleration solution. It supports a wide range of performance levels and readily scales for applications ranging from smaller, cost-sensitive Internet of Things (IoT) devices to larger performance-oriented IoT devices.

NVDLA is provided as a set of IP-core models based on open industry standards: the Verilog model is a synthesis and simulation model in RTL form, and the TLM SystemC simulation model can be used for software development, system integration, and testing.

The NVDLA software ecosystem includes an on-device software stack (part of the open-source release), a full training infrastructure to build new models that incorporate Deep Learning, and parsers that convert existing models to a form that is usable by the on-device software.

The NVDLA core hardware has six specialized hardware units which can be scheduled either concurrently or in a pipelined configuration. It also has both small and large hardware profiles.

The large profile includes advanced features such as an on-chip SRAM interface and the ability to attach a microcontroller. You can find further details about NVDLA’s profiles here.

The hardware architecture is modular, and it is designed to be scalable from small embedded IoT designs to large data center-class chips using arrays of NVDLA units.

The compiler can be tuned based on various chosen factors: the NVDLA hardware configuration, the system’s CPU and memory controller configurations, and the application’s custom neural network use cases if desired.

Compiler optimizations such as layer fusion and pipeline scheduling work well for larger NVDLA designs, providing up to a 3x performance benefit across a wide range of neural network architectures.

This optimization flexibility is key to achieving a massive power efficiency across both large network models like ResNet-50 and small network models like MobileNet.

For smaller NVDLA designs, compiler optimizations such as memory tiling are critical for power efficiency. Memory tiling enables a design to balance on-chip buffer usage between weight and activation data, and so minimizes off-chip memory traffic and power consumption.

Furthermore, users are free to create fully customized layers tuned for their own specialized use cases or experiment with the latest cutting-edge algorithms published in research.

Users can gauge the expected performance of the default NVDLA large profile model based on the performance numbers below. Measurements were captured using one of the two NVDLA cores on a Jetson AGX Xavier Developer Kit.

The open-source NVDLA project is managed as an open, directed community. NVIDIA welcomes contributions to NVDLA & will maintain an open process for developers who wish to submit changes back.

The contributors are expected to agree to a Contributor License Agreement, ensuring that any IP rights from a contributor are granted to all NVDLA users.

After the initial release, the development will take place in the open. NVDLA software, hardware, and documentation will be made available through GitHub.

NVDLA hardware and software are available under the NVIDIA Open NVDLA License, which is a permissive license that includes a FRAND-RF patent grant.

Additionally, for users who build “NVDLA-compatible” implementations which interact well with the greater NVDLA ecosystem, NVIDIA may grant the right to use the “NVDLA” name, or other NVIDIA trademarks.

(This licensing description is meant to be informative, not normative; where this information conflicts with the NVDLA license, the NVDLA license supersedes.)

“We are incredibly excited to see NVIDIA leading efforts developing the open-source Machine Learning ecosystem,” said Yunsup Lee, CTO/co-founder of SiFive and co-inventor of RISC-V.

SiFive first demonstrated NVDLA running on the SiFive Freedom platform a year ago, and the new performance-optimized open-source NVDLA compiler further enables SiFive to create domain-specific optimized SoC designs ready for the modern compute needs of AI in the IoT Edge.