# SoK: Attestation Aggregation Algorithms

## Table of Contents

**If you want to suggest sth, please, feel free to contribute to the ethresear.ch topic.**

## 1. Background

Let’s start from some relevant pre-requisite information:

- BLS signature scheme on which the whole aggregation process is based: it is precisely because we can produce compressed BLS signatures, aggregation of non-overlapping attestations is possible.
- There is an amazing note on Ethereum 2.0 attestation aggregation strategies which has the best coverage of current efforts on the problem of attestation aggregation. Strategies listed there are not essentially algorithms to merge/pack attestations, but more of a higher level approaches (which include changing of a way how validators’ overlay network is constructed and used).
- There’s also a 1-pager design of Better Aggregation Inclusion by Preston and Terence.

## 2. Objective

In order to increase profitability, attestations must be aggregated in a way to cover as many individual attestors as possible.

In terms of expected input/output here is what we have:

- Within the scope of this doc, an attestation is a bitlist of length equal to number of validators in a given committee, where each bit represents a single validator and there may be one or more of bits set per attestation.
- Array of attestations is represented by incidence matrix, where rows are attestation instances, and each column represents a validator inclusion status across all known attestations.
- Given a list of attestations, we are expected to aggregate all the non-overlapping ones, ideally having an aggregation which includes all the possible attesting participants.

```
1// Sample attestations:
2A0 = [
3 [0 0 0 1 0], // A
4 [0 0 1 0 0], // B
5 [0 1 0 1 0] // C
6]
7
8// List can be transformed into:
9A1 = [
10 [0 0 1 1 0], // A + B
11 [0 1 0 1 0] // C
12]
13
14// Or, even better:
15A2 = [
16 [0 0 0 1 0], // A
17 [0 1 1 1 0] // B + C
18]
```

At the moment, Prysm uses $O(mn^2)$ iterations to search for possible aggregation pairs: the $O(n^2)$ is for pairing attestations and another $O(m)$ to check for overlaps within each pair ($m$ is number of validators in a given committee).

It should be noted that, considering the fact that we are dealing with NP-Hard problem here, the runtime is not that bad, at all!

When it comes to performance of such a traversal (that’s how optimal the resultant aggregation is), then it solely depends on order in which attestations are presented: once a pair of non-overlapping attestations is located they are immediately merged, without any decision-making whether there’s a more profitable aggregation or not.

Here is a bit contrived example of such a suboptimal case:

```
1// Given the following attestations:
2A0 = [
3 [0 0 1 0 0],
4 [0 0 0 1 0],
5 [0 0 0 0 1],
6 [1 1 0 0 1],
7]
8
9// Optimal solution (merge: rows 1, 2 and 4; discard: row 3):
10A_OPT = [
11 [1 1 1 1 1],
12]
13
14// Solution produced by the current solver:
15A1 = [
16 [0 0 1 1 1],
17 [1 1 0 0 1],
18]
```

So, the main objective is to **find an approximation algorithm that will result in a solution
which is more close to the optimal one**.

In our problem, solutions are scored by their cardinality: the more participants we have within a single aggregated item the better, with the maximum possible size equal to all the validators in a given committee.

When it comes to run-time efficiency, the aim is to make sure that aggregation doesn’t become a bottleneck in node’s processing (it should be able to process thousands of attestations in sub-second runs). This will be tackled via benchmarking alternative approaches.

## 3. Formal Problem Statement

### 3.1. Definition

**Def (Attestation Aggregation): $AA(U, S, k) \to S'$**

Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let $S = {S_1, …, S_m | S_i \subseteq 2^U}$ be a collection of its subsets, where $\bigcup_{i = 1}^m S_i = U$ i.e. all $u \in U$ are present in one of elements of $S$.

Then, **Attestation Aggregation** (AA) is the problem of finding $S^\prime \subseteq S$ that
covers at least $k \in [1..n]$ elements from $U$, and sets in $S'$ are disjoint:
$$
|\bigcup\limits_{T \in S’}T| \ge k
$$
and
$$
S^\prime_i \cap S^\prime_j = \emptyset, \forall i, j \in [1..m], i \ne j
$$

Ideally, we want $\bigcup\limits_{T \in S’}T$ to have maximum cardinality, that’s $k = |U|$, and all $u \in U$ are covered by $S'$: $$ |\bigcup_{T \in S’}T| = |U| $$

Since BLS doesn’t allow merging overlapping signatures, there’s that additional constraint of making sure that all elements of $S'$ are pairwise disjoint.

To summarize: given a family of sets $S$, we need to find a subfamily of disjoint sets $S'$, which have the same (or close to same) union as the original family.

The problem is NP-Complete and only allows for logarithmic-factor polynomial-time approximation.

### 3.2. Comparison to Known Problems

#### 3.2.1. Attestation Aggregation (AA) vs Minimum Set Cover (SC)

In the MSC we have the very same input set system $(U, S)$, but our target $S'$ is a bit different: we want to find a full cover of $U$ with minimal $|S'|$.

With AA, partial (if still maximal) coverage is enough, there’s no constrains on cardinality of $S'$, and all elements of $S'$ are pairwise disjoint.

#### 3.2.2. Attestation Aggregation (AA) vs Exact Cover (EC)

Again, we start from the same set system $(U, S)$, and the EC matches the ideal case of our problem when there exists an optimal solution within a given input $S$. So, if input list of attestations form (by itself or as any combination of its subsets) a full partition of $U$, the resultant $S'$ for both EC and AA coincide.

There is on important difference in AA: it allows for partial covers.

#### 3.2.3. Attestation Aggregation (AA) vs Maximum Coverage (MC)

In the MC problem, we want to find up to $k$ subsets that cover $U$ maximally: $$ |S'| \le k \land \mathop{argmax}\limits_{S’} |\bigcup\limits_{T \in S’}T| $$

Important thing to note is that in its conventional form MC doesn’t require elements of $S'$ to be disjoint, which is a problem for our case – as overlapping attestations cannot be aggregated.

So, the important differences of AA include: no constraints on cardinality of $S'$, requirement of pairwise disjoint elements in $S'$.

MC can still be utilized for our purposes: since there exists an approximation algorithm with $\alpha \approx 0.6$ (pretty impressive) we can rely on it to build partial solution by gradually increasing $k$ (see the Possible Solutions section below).

### 3.3. Sample problem

Since $S’ \subseteq S$, when implementing the relation in code, the optimal solution will be represented as a list of indexes corresponding to selected items from $S$ (to avoid unnecessary copying):

$$ I = {i | i \in [1..m] \land S_i \in S^\prime} $$

To better illustrate the formal definition, here is a sample problem:

```
1// We're given list of attestations, in matrix incidence form.
2// Those attestations are essentially instances of S_i:
3S = [
4 [0 0 1 0 1 0], // participating bits are at columns: 3, 5
5 [0 1 0 0 0 0],
6 [0 0 0 0 1 0],
7 [1 0 0 0 0 0],
8 [0 0 1 0 0 1],
9]
10
11// Union of all *column indexes* where participation bit is set to 1,
12// for at least one attestation, represents the universe set, U:
13U = [1, 2, 3, 5] // w/i given list of attestations, there's no bit at column 4
14
15// We want to come up with a subcollection of attestations,
16// which are disjoint and represent maximum number of the validators
17// specified in initial list of attestations, S':
18I = [2, 3, 4, 5] // merge rows 2, 3, 4, 5
```

## 4. Possible Solutions

So, our problem is closely related to set cover kind of problems to which there exist several possible approaches, none of which enjoys having a deterministically optimal solution.

Several closely related NP/NP-hard problems (and their variants) have been considered:

The Exact Cover and Maximum Coverage problems seem to be the most relevant to what we have at hand, and with some twisting can be utilized to solve AA.

### 4.1. Set Cover

The Set Cover problem is one of Karp’s 21 NP-Complete problems.

It seems natural to start from the base covering problem because it serves as a foundation for other problems, it has a greedy algorithm solver with $ln(n)$ approximation to optimal, and with some effort we can even make that greedy solver run in a linear time!

**Def (Minimum Set Cover): $MSC(U, S) \to S'$**

Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let $S = {S_1, …, S_m | S_i \subseteq 2^U}$ be a collection of its subsets, where $\bigcup_{i = 1}^m S_i = U$.

Then, **Minimum Set Cover** (MSC) is the problem of covering
$U$ with a subset $S^\prime \subseteq S$ s.t $|S'|$ is minimal.

Framed like that, this problem doesn’t abstract attestation aggregation completely. While MSC produces a cover of $U$, $S'$ may contain subsets with overlapping elements from $U$, and as such can’t be used as input to aggregation function. So, we need to add an extra constraint – making sure that all elements in $S'$ are pairwise disjoint.

#### Relevant Works:

- Approximation Algorithms: Covering Problems by Chandra Chekuri
- Greedy Approximation: Set Cover short intro by Dave Mount
- github.com/guangtunbenzhu/SetCoverPy interesting implementation using a combination of greedy approach and Lagrangian relaxation.

**Def (Minimum Membership Set Cover): $MMSC(U, S, k) \to S'$**

The same set system as in MSC, with additional requirement on how many times each $u \in U$ can occur in elements of $S'$ i.e. $\mathop{max}\limits_{u \in U} |{T \in S'| u \in T}| \le k$, for a nonnegative $k \in [1..m]$.

#### Relevant Works:

- Minimum Membership Set Covering and the Consecutive Ones Property by Michael Dom, Jiong Guo, Rolf Niedermeier, Sebastian Wernicke
- Interference in Cellular Networks: The Minimum Membership Set Cover Problem by Fabian Kuhn, Pascal von Rickenbach, Roger Wattenhofer, Emo Welzl, and Aaron Zollinger
- CS261: Optimization Handout, A Linear Programming Relaxation of Set Cover by Luca Trevisan

Applicability of MMSC:

- Decision version of the problem (whether $S'$ exists) can be used to check for cover.
- When used as $MMSC(U, S, 1)$ i.e. limit number of occurrences of $u \in U$ to a single occurrence, we effectively transform problem to Exact Cover variant (which matches our ideal case exactly).

Another variant worth mentioning is Partial Set Cover, where we again are looking for $S'$ of minimal cardinality (just as we do in MSC) which covers at least $k$ elements from universe $U$.

**Def (Partial Set Cover): $PSC(U, S, k) \to S'$**

Consider the same set system as in MSC, with additional parameter $k \in [1..m]$.
Then, **Partial Set Cover** (PSC) is the problem of finding $S’ \subseteq S$ of minimal cardinality,
that covers at least $k$ elements of $U$.

#### Partial Set Cover (PSC) vs Maximum Coverage (MC)

PSC differs from Maximum Coverage problem in a subtle way: in the MC we limit number of subsets $|S'| \le k$ for maximum covered elements in $U$; in PSC we limit upper bound on how many items are covered $|\bigcup\limits_{T \in S’}T| \le k$ with $S'$ of minimal cardinality.

#### Relevant Works:

- On Approximating Partial Set Cover and Generalizations by Chandra Chekuri, Kent Quanrud, Zhao Zhang
- A Unified Approach to Approximating Partial Covering Problems by Jochen Konemann, Ojas Parekh, Danny Segev

Applicability of PSC:

- Again, decision version can be useful, to check the boundaries (gradually increasing $k$) of $S'$ existence. With $k = |U|$ we effectively have MSC problem. In order for PSC be really useful, we also need to constrain number of occurrences of $u \in U$ within $S'$ elements i.e. so that all subsets in $S'$ are pairwise disjoint.

### 4.2. Exact Cover

The Exact Cover problem is one of Karp’s 21 NP-Complete problems.

When exact cover exists within a given set system, the Exact Cover abstracts attestation aggregation perfectly. The problem is that perfectly non-overlapping partitions of $U$ are not naturally happening in our system (so making them happen can be an attack vector when solving the problem :question:).

**Def (Exact Cover): $EC(U, S) \to S'$**

Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let $S = {S_1, …, S_m | S_i \subseteq 2^U}$ be a collection of its subsets, where $\bigcup_{i = 1}^m S_i = U$.

Then, **Exact Cover** (EC) is the problem of covering
$U$ with a subset $S^\prime \subseteq S$ s.t $S^\prime_i \cap S^\prime_j = \emptyset, \forall i, j \in [1..m], i \ne j$.

This NP-Hard problem has a nondeterministic backtrack solver algorithm
(Algorithm X by D.Knuth).
The Algorithm X is capable of finding *all* the optimal solutions to the problem.

However, having such an $S$ that there exists a subcollection of pariwise disjoint subsets that cover $U$ completely is a rare luck in our system (is it :question:). More than often $S$ will not contain the solution to EC. In such cases, we still want some partial solution, even if only part of attesters can be collected within a single aggregation.

So, adding constraint similar to MMSC (where we limited number of times $u \in U$ can occur in $S'$), we need to transform the problem into accepting another parameter $k \in [1..n]$, with the purpose of finding the $S'$, where $|\bigcup\limits_{T \in S’}T| \ge k$ i.e. union of elements of found subsets covers at least $k$ elements of $U$. Then by gradually increasing $k$ we want it to be as close to $|U|$ as possible (max k-cover? :question:).

#### Relevant Works:

- Dancing Links by D.Knuth
- Analysing Algorithm X Good overview of algorithm (with additional heuristic to improve performance if single solution is enough)
- github.com/Kappeh/dlx (not particularly optimized implementation)

Applicability of EC:

- If solution exists, then Algorithm X (effectively implemented using DLX) can find it. If full solution is impossible, we need to explore possibility of finding partial cover.

### 4.3. Maximum Coverage

**Def (Maximum Coverage): $MC(U, S, k) \to S'$**

Let $U$ be a finite set of objects, where $|U| = n$. Furthermore, let $S = {S_1, …, S_m | S_i \subseteq 2^U}$ be a collection of its subsets, where $\bigcup_{i = 1}^m S_i = U$.

Then, **Maximum Coverage** (MC) is the problem of
finding $S’ \subseteq S, |S'| \le k$ covering $U$ with maximum cardinality, that’s

$$ |S'| \le k \land \argmax_{S^\prime}|\bigcup_{T \in S^\prime}T| $$

#### Relevant Works:

- Analysis of the Greedy Approach in Problems of Maximum k-Coverage by Dorit S. Hochbaum, Anu Pathria
- Online maximum k-coverage by Giorgio Ausiello, Nicolas Boria, Aristotelis Giannakos, Giorgio Lucarelli, Vangelis Th. Paschos
- A Multi-objective Disjoint Set Covers for Reliable Lifetime Maximization of Wireless Sensor Networks by Bara’a A. Attea, Enan A. Khalil, Suat Özdemir & Oktay Yıldız
- Approximating Maximum Disjoint Coverage in Wireless Sensor Networks by Shagufta Henna and Thomas Erlebach
- The Online Disjoint Set Cover Problem and its Applications by Ashwin Pananjady, Vivek Kumar Bagaria, Rahul Vaze

Applicability of MC:

- With additional requirement of $S_i \cap S_j, \forall i, j \in [1..m], i \ne j$ (pairwise disjoint sets in $S'$) we can have a very useful mechanism to build approximate solutions using greedy approach.

## Summary and Further Work

So, possible solutions can be enumerated as following:

**Exact Cover (EC)**- Can be used to check for solutions if situations when perfect solution exist are not rare.
- If combined with Partial Set Cover (PSC) for partial cover solutions, can match Attestation Aggregation perfectly.

**Maximum Coverage (MC)**- Greedy algorithm + additional constraint of disjoint sets in $S'$
- Gradual increase of $k$ ($1 \to |S|$) to obtain maximal cover for a maximum number of available attestations.

Some items to consider:

- We, need a section with formal runtime analysis – just to make sure upper bounds of what can be done is known.
- Like for many NP-Hard problems, one can get polynomial time algorithm, if the problem size is fixed (parameterized complexity) . Say, for $S'$ subsets, if we only look into solutions of size 4, we are dealing with $O(|S|^4)$ items, even if we enumerate every possible 4-tuples. We might consider some optimization based on this.
- Since, we’re dealing with NP-Hard problems, there’s little chance we will not be bothered by runtime complexity. Another possible attack vector is to rely on parallel computing.
- Consider combining algorithms or running individual algorithms in rounds.
- There are number of highly effective aggregation heuristics that rely on how data is transmitted i.e. instead of concentrating on covering arbitrary set systems, we try to come up with heuristic that will result in a preferable attestations propagating the network ( see Heuristically Partitioned Attestation Aggregation for a very interesting approach).