Home / SmartTech / Facebook’s AutoScale decides whether AI inference is performed on your phone or in the cloud

Facebook’s AutoScale decides whether AI inference is performed on your phone or in the cloud

In a technical article published this week on Arxiv.org, researchers from Facebook and Arizona State University opened the hood of AutoScale, which shares a name with Facebook’s energy-sensitive load balancer. AutoScale, which could theoretically be used by any company to make the code publicly available, uses AI to enable energy-efficient inference on smartphones and other edge devices.

Many AI run on smartphones – in the case of Facebook, the models on which 3D photos and other such functions are based – but this can result in reduced battery life and performance without the need for fine-tuning. The decision whether to run AI on the device, in the cloud, or in a private cloud is therefore not only important for end users, but also for companies that develop AI. Data centers are expensive and require an internet connection. When AutoScale automates deployment decisions, it can result in significant cost savings.

With each inference execution, AutoScale monitors the current execution rate, including the architecture features of the algorithm and runtime deviations (such as WiFi, Bluetooth, and LTE signal strength, processor utilization, voltage, frequency scaling, and memory usage). Then hardware (processors, graphics cards, and co-processors) are selected that are expected to maximize energy efficiency while meeting quality of service and inference goals based on a lookup table. (The table shows the accumulated rewards ̵

1; values ​​that drive the underlying AutoScale models to achieve the goals – the previous selection.) Next, AutoScale inferences the goal defined by the selected hardware while observing the outcome, including energy, latency and inference accuracy. Based on this, and before the table is updated, the system calculates a reward indicating how much the hardware selection has improved efficiency.

As the researchers explain, AutoScale uses reinforcement learning to learn a guideline for choosing the best action for an isolated condition based on the accumulated rewards. For example, for a given processor, the system calculates a reward with a usage-based model that assumes that (1) processor cores consume a variable amount of energy; (2) cores spend time in occupied and inactive states; and (3) energy consumption varies between these states. In contrast, when scaling inference to a connected system such as a data center, AutoScale calculates a reward using a signal strength-based model that takes into account transmission latency and power consumed by a network.

VB Transform 2020 Online – 15.-17. July: Join leading AI leaders at the AI ​​event of the year. Register today and save 30% on digital access passes.

To validate AutoScale, the co-authors of the paper conducted experiments with three smartphones, each measured with a power meter: the Xiaomi Mi 8 Pro, the Samsung Galaxy S10e and the Motorola Moto X Force. To simulate the execution of cloud inferences, they connected the handsets via WLAN to a server and simulated the local execution with a Samsung Galaxy Tab S6 tablet that was connected via Wi-Fi Direct (a wireless peer-to-peer network). was connected to the phones.

After training AutoScale by performing inferences 100 times (resulting in 64,000 training examples) and compiling and generating 10 executables with popular AI models, including Google’s MobileBERT (a machine translator) and Inception (an image classifier) the team tests in a static environment (with consistent processor, memory usage and signal strength) and dynamic setting (with a web browser and a music player in the background and signal inference). Three scenarios were developed for each:

  • A non-streaming computer vision test scenario in which a model inferred a photo from the phone’s cameras.
  • A streaming computer vision scenario in which a model inferred real-time video from the cameras.
  • A translation scenario in which the translation was performed for a sentence typed by the keyboard.

The team reports that AutoScale outperformed the baseline in all scenarios while maintaining low latency (less than 50 milliseconds in the non-streaming computer vision scenario and 100 milliseconds in the translation scenario) and high performance (about 30 frames per second in streaming) -Computer vision) maintains scenario). In particular, this resulted in a 1.6- to 9.8-fold improvement in energy efficiency while achieving 97.9% predictive accuracy and real-time performance.

In addition, AutoScale only ever had a memory requirement of 0.4 MB, which corresponds to 0.01% of the 3 GB RAM capacity of a typical midrange smartphone. “We show that AutoScale is a viable solution and will pave the way forward by enabling future work to improve energy efficiency for inferring DNN edges in a variety of realistic execution environments,” wrote the co-authors.

Source link