# GreeDi¶

GreeDi is an optimizer that was designed to work on data sets that are too large to fit into memory. The approach involves first partitioning the data into $$m$$ equally sized chunks without any overlap. Then, $$l$$ elements are selected from each chunk using a standard optimizer like naive or lazy greedy. Finally, these $$ml$$ examples are merged and a standard optimizer selects $$k$$ examples from this set. In this manner, the algorithm sacrifices exactness to ensure that memory limitations are not an issue.

There are a few considerations to keep in mind when using GreeDi. Naturally, $$ml$$ must both be larger than $$k$$ and also small enough to fit into memory. The larger $$l$$, the closer the solution is to the exact solution but also the more compute time is required. Conversely, the larger $$m$$ is, the less exact the solution is. When using a graph-based function, increasing $$m$$ can dramatically reduce the amount of computation that needs to be performed, as well as the memory requirements, because the similarity matrix becomes smaller in size. However, feature-based functions are likely to see less of a speed improvement because the cost of evaluating an example is independent of the size of ground set.

## API Reference¶

An approach for optimizing submodular functions in parallel.

This optimizer implements the GreeDi method for selecting sets in parallel by Mirzasoleiman et al. (https://papers.nips.cc/paper/5039-distributed- submodular-maximization-identifying-representative-elements-in-massive- data.pdf).

Briefly, this approach splits the data into m partitions uniformly at random, selects l exemplars from each partition, and then runs a second iteration of greedy selection on the union of l*m exemplars from each partition to get the top k.

### Parameters¶

function : base.SubmodularSelection
A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized.
m : int
The number of partitions to split the data into.
l : int
The number of exemplars to select from each partition. l*m must be larger than the number of exemplars k that will be selected later on.
optimizer1 : str or base.Optimizer, optional
The optimizer to use in the first stage of the process where l exemplars are selected from each partition. Default is ‘lazy’.
optimizer2 : str or base.Optimizer, optional
The optimizer to use in the second stage where k exemplars are selected from the l*m exemplars returned from the first stage. Default is ‘lazy’.
random_state : int or RandomState or None, optional
The random seed to use for the random selection process.
verbose : bool
Whether to display a progress bar during the optimization process.

### Attributes¶

self.function : base.BaseSelection or None
A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized. If None, will be set by the selector when passed in.
self.m : int
The number of partitions that the data will be split into.
self.l : int
The number of exemplars that will be selected from each partition.
self.verbose : bool
Whether to display a progress bar during the optimization process.
self.gains_ : numpy.ndarray or None
The gain that each example would give the last time that it was evaluated.