4/11 Deep Sets Review

Last time, we started discussing learning on sequences. More precisely, we denote by \(\mathscr X_0\) the elementwise feature space, which can be either:

  1. a discrete set: \(\mathscr X_0\cong[n]\), sometimes called a vocabulary, or
  2. a subset of a Euclidean space: \(\mathscr X_0\subseteq\mathbf R^d\).

Then the feature space is the set \(L\mathscr X_0\) of finite sequences of entries in \(\mathscr X_0\). We denote by \(L_m\mathscr X_0\subseteq L\mathscr X_0\) the subset of finite sequences of \(m\) elements.

The subset \(L_m\mathscr X_0\) has an action by the permutation group \(S_m\).

A Deep Set is a model architecture that is invariant to this permutation action. Given a sequence \(X=(\mathbf x_1,\dotsc,\mathbf x_m)\in L_m\mathscr X_0\), the image by the model is defined by

$$ f_\theta(X)=h_\xi\left(p\left(g_\phi(\mathbf x_1),\dotsc,g_\phi(\mathbf x_m)\right)\right), $$ where:

  1. The elementwise map \(g_\phi\) is:
    1. a collection of word vectors \(\{\mathbf v_i\in\mathbf R^\ell:i\in\mathscr X_0\}\), if the elementwise feature space \(\mathscr X_0\) is discrete, and
    2. an MLP \(\mathscr X_0\to\mathbf R^\ell\) if \(\mathscr X_0\) is continuous,
  2. The function \(\mathbf R^{\ell\times m}\xrightarrow p\mathbf R^\ell\) is a permutation-invariant pooling function, usually mean, max, min or sum, and
  3. The invariant map \(h_\xi\) is an MLP \(\mathbf R^\ell\to\mathbf R^k\).