Bandits, WebAssembly, and IoT. An uncommon combination allows… | by Maurits Kaptein | Nov, 2020


A part of the Bootstrap Thompson Sampling algorithm (image by author).

An uncommon combination allows efficient sequential learning on the edge

Maurits Kaptein

The multi-armed bandit (MAB) problem is a relatively simple (to describe that is, not to solve) formalization of a sequential learning problem that has many applications. In its canonical form, a gambler faces a slot machine (colloquially called a one-armed-bandit) with k >1 arms (hence the term multi-armed bandit). Given a (finite) amount of coins, the gambler sequentially chooses which arm to play — subsequently she inserts a coin and pulls the arm — after each play observing some reward (e.g., a win or a loss — typically not winning coins that can be re-used). As the gambler one-by-one squanders her coins, she hopes to attain as high a cumulative reward as possible: this is tricky since each arm is assumed to have a different (static) probability of paying out. A successful strategy of playing the machine’s arms needs to on one hand explore the options (i.e., the gambler needs to play the arms whose expected rewards she is uncertain about to learn more), while on the other hand exploiting the available knowledge (i.e., the gambler needs to play the arm with the highest expected pay-off as often as possible).

The simple gambling setup above matches, sometimes somewhat loosely, many applied problems. Consider the following alternatives for the terms in bold above:

  • Gambler, medical doctor, online marketeer, e-commerce manager, …

Because of this generality, the bandit problem formalization has been used to study many applied problems. Furthermore, effective strategies (often coined policies in the bandit literature which are formally mappings from the historical data to the next arm) to address bandit problems are used in many applications.

Before discussing how we implemented effective bandit policies on edge devices using WebAssembly, it is good to explore a few possible bandit policies:

  1. Just choose one: One simple, but not very effective, strategy is to just choose the arm you think is best and play it all the time. This policy does not have a formal name in the bandit literature, but it is common in practice: (e.g.,) online marketeers often simple create a single ad based on their own expert knowledge without doing any experimentation.

While both 4. and 5. above can be shown to be asymptotically optimal — depending a bit on the exact bandit setup and the exact bounds etc. involved — even 2 and 3 above can be extremely useful and generally grossly outperform 1. For a comprehensive overview and analysis of bandit algorithms see this amazing book.

Bandit policies are often implemented on servers: the gambler effectively lives in a single “location”. For example, the sequential selection of ads to maximize CTR is often centrally coordinated. However, there are various applications in which one would like to deploy a bandit policy on an edge device: for example, the problem of selecting the fastest route through a network can often be formalized as a bandit problem.

To implement bandit policies in a software system, one effectively needs to implement 2 functions:

  1. An action assignment function: When called this function returns the next arm to pull.

This is detailed quite clearly in this recent Journal of Statistical Software paper.

Perhaps surprisingly, both functions are easy to implement — even in a modular way — on edge devices. Using WebAssembly both steps can be implemented efficiently and effectively be deployed on any edge device. We recently used the Scailable platform to provide a small demo for one of our clients: We implemented e-greedy in the standard exposed predict function of a Scailable binary (see this tutorial), and we used orthogonal persistence to implement the update function such that the state of the machine could be update: see this Towards Data Science post for details. Iteratively calling the predict and update functions allowed us to implement a bandit policy on a ESP32 MCU. Pretty neat!

The bandit problem formalization provides an extremely useful model for many real-world sequential learning problems. Furthermore, the idea of splitting up a bandit policy into two simple function calls is both simple and powerful and can easily be extended to implement contextual bandit policies and batched policies. Finally, doing so in WebAssembly binaries allows for the efficient deployment and monitoring of bandit policies on various edge devices. This is pretty cool since, although rudimentary, we can now have each little chip learn from its direct interactions with its environment!

It’s good to note my own involvement here: I am a professor of Data Science at the Jheronimus Academy of Data Science and one of the cofounders of Scailable. Thus, no doubt, I have a vested interest in Scailable; I have an interest in making it grow such that we can finally bring AI to production and deliver on its promises. The opinions expressed here are my own.


Source link

Write a comment