4/16 Permutation Invariant and Equivariant Linear Layers

Permutation Equivariant Linear Layers

Note that in a deep set, the features of the set elements interact only once, through the pooling map. We can enable richer interaction between the set element features if we use an MLP with permutation-equivariant layers.

If part of incoming data has sequence length $s$, then a hidden layer in our model is of the form $$ \mathbf R^{s\times d^{(\ell)}}\xrightarrow{A^{(\ell)}}\mathbf R^{s\times d^{(\ell+1)}}\xrightarrow a\mathbf R^{s\times d^{(\ell+1)}}, $$ where, as before:

We denote by $A^{(\ell)}$ an affine map $A^{(\ell)}(X)=W^{(\ell)}(X)+B^{(\ell)}$, and
We denote by $a$ the activation functions, acting componentwise.

The matrix spaces have an $S_s$-action by permuting the rows.

Theorem 1 [1, Example II.2, modified]. The map $A^{(\ell)}$ is $S_s$-equivariant if and only if the following two conditions hold:

We have

$$ W^{(\ell)}(X)=XW^{(\ell)}_1 + \frac{1}{s}(\boldsymbol1\boldsymbol1^T)XW^{(\ell)}_2 $$ for matrices $W^{(\ell)}_1,W^{(\ell)}_2\in\mathbf R^{d^{(\ell)}\times d^{(\ell+1)}}$, where $\boldsymbol1\in\mathbf R^s$ is the all-1 vector.
The bias matrix $B^{(\ell)}$ is constant along the sequence dimension, that is it is given by a vector $\mathbf b^{(\ell)}\in\mathbf R^{d^{(\ell + 1)}}$.

Note that besides the mean operation, the map does not depend on the number of elements $s$. Thus, we can use the same map for sequences of different size.

Question

How does the existence of such a construction, with multiple interactions between set element features, not contradict Theorem 1 of the previous lecture?

Invariant and Equivariant Graph Networks

Note that so far, we considered the simplest way a permutation group $S_s$ can act on a tensor: by reordering the entries along a dimension. This way, we can view the feature tensor as a set of feature vectors, that is where the order of the subtensors along the sequence dimension does not matter.

Now, we shall generalize the parameter sharing approach to graphs and hypergraphs. Note that a graph on $n$ vertices can be given by its adjacency matrix $A\in\mathbf R^{n\times n}$. Or, maybe, it has edge features, in which case, it can be given by a 3-tensor $A\in\mathbf R^{n\times n\times d}$. Similarly, a labeled $r$-hypergraph can be given by an $(r+1)$-tensor $A\in^{n^r\times d}$.

For example, a point cloud $X=\{x_1,\dotsc,x_n\}$ in a metric space can be given by its distance matrix $A_{ij}=d(x_i,x_j)$.

Note that the description by adjacency matrix $A$ is redundant: it is insensitive to the ordering of the vertices. This is expressed by the following $S_n$-action: for $\pi\in S_n$, permuting the element order, we get the adjacency matrix $(A^\pi)_{ij}=A_{\pi(i),\pi(j)}$.

In what follows, we will recount the main result of [2], that gives an orthonormal basis to the vector space of invariant linear maps $\mathbf R^{n^r}\to\mathbf R$. This includes the equivariant case as the space of equivariant maps $\mathbf R^{n^r}\to\mathbf R^{n^s}$ is isomorphic to the space of invariant maps $\mathbf R^{n^{r+s}}$. Moreover, the space of invariant maps $\mathbf R^{n^r}\to\mathbf R$ is isomorphic to the space $\mathbf R^{n^r}$ of tensors.

Notation 2. For multi-indices $\mathbf a, \mathbf b\in[n]^\ell$, we write $\mathbf a\sim\mathbf b$ if the following condition holds:

\[ a_i=a_j\text{ if and only if }b_i=b_j\text{ for }i,j\in[\ell]. \]

Proposition 3. The relation $\sim$ equips the set $[n]^\ell$ of multi-indices with an equivalence relation. Moreover, the map

\[ ([n]^\ell/\sim)\xrightarrow{[\mathbf a]\mapsto\{\{j\in[\ell]:\mathbf a_i=\mathbf a_j\}:i\in[\ell]\}}\{\text{Partitions of }[\ell]\} \]

is well defined and injective. It is surjective if and only if we have $n\ge\ell$.

Notation 4. Given an equivalence class $\gamma\in[n]^\ell/\sim$, let the tensor $B^\gamma\in\mathbf R^{n^\ell}$ be defined by

\[ B^\gamma_\mathbf a=\begin{cases} 1 & \mathbf a\in\gamma, \\ 0 & \text{else,} \end{cases}\quad\text{ for }\mathbf a\in[n]^\ell. \]

Theorem 5 [2, Proposition 2]. The maps $\mathbf R^{n^\ell}\to\mathbf R$ given by the tensors $B^\gamma:\gamma\in[n]^\ell$ form an orthogonal basis of the space of invariant maps. In particular, if we have $n\ge\ell$, then the number of dimensions of this space is the Bell number $b(\ell)$, the number of possible partitions of the set $[\ell]$.

References

[1] Siamak Ravanbakhsh, Jeff Schneider and Barnabás Póczos: Equivariance Through Parameter-Sharing. Proceedings of the 34th International Conference on Machine Learning (ICML), 2017, vol. 70, pp. 2892--2091. link

[2] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, Yaron Lipman: Invariant and Equivariant Graph Networks. International Conference on Learning Representations (ICLR), 2019. link