New AI technique speeds up language models on edge devices

On May 30, 2020

Google’s Transformer is widely used in natural language processing (and even some computer vision) tasks because of its cutting-edge performance. Nevertheless, Transformers remain challenging to deploy on edge devices because of their computation cost; on a Raspberry Pi, translating a sentence with only 30 words requires 13 gigaflops (one billion floating-point operations per second) and takes 20 seconds. This obviously limits the architecture’s usefulness for developers and companies integrating language AI with mobile apps and services.

The researchers’ solution employs neural architecture search (NAS), a method for automating AI model design. HAT performs a search for edge device-optimized Transformers by first training a Transformer “supernet” — SuperTransformer — containing many sub-Transformers. These sub-Transformers are then trained simultaneously, such that the performance of one provides a relative performance approximation for different architectures trained from scratch. In the last step, HAT conducts an evolutionary search to find the best sub-Transformer, given a hardware latency constraint.

To test HAT’s efficiency, the coauthors conducted experiments on four machine translation tasks consisting of between 160,000 and 43 million pairs of training sentences. For each model, they measured the latency 300 times and removed the fastest and slowest 10% before taking the average of the remaining 80%, which they ran on a Raspberry Pi 4, an Intel Xeon E2-2640, and an Nvidia Titan XP graphics card.

According to the team, the models identified through HAT not only achieved lower latency across all hardware than a conventionally trained Transformer, but scored higher on the popular BLEU language benchmark after 184 to 200 hours of training on a single Nvidia V100 graphics card. Compared to Google’s recently proposed Evolved Transformer, one model was 3.6 times smaller with a whopping 12,041 times lower computation cost and no performance loss.

“To enable low-latency inference on resource-constrained hardware platforms, we propose to design [HAT] with neural architecture search,” the coauthors wrote, noting that HAT is available in open source on GitHub. “We hope HAT can open up an avenue towards efficient Transformer deployments for real-world applications.”