GreeDi

GreeDi is an optimizer that was designed to work on data sets that are too large to fit into memory. The approach involves first partitioning the data into \(m\) equally sized chunks without any overlap. Then, \(l\) elements are selected from each chunk using a standard optimizer like naive or lazy greedy. Finally, these \(ml\) examples are merged and a standard optimizer selects \(k\) examples from this set. In this manner, the algorithm sacrifices exactness to ensure that memory limitations are not an issue.

There are a few considerations to keep in mind when using GreeDi. Naturally, \(ml\) must both be larger than \(k\) and also small enough to fit into memory. The larger \(l\), the closer the solution is to the exact solution but also the more compute time is required. Conversely, the larger \(m\) is, the less exact the solution is. When using a graph-based function, increasing \(m\) can dramatically reduce the amount of computation that needs to be performed, as well as the memory requirements, because the similarity matrix becomes smaller in size. However, feature-based functions are likely to see less of a speed improvement because the cost of evaluating an example is independent of the size of ground set.

API Reference

An approach for optimizing submodular functions in parallel.

This optimizer implements the GreeDi method for selecting sets in parallel by Mirzasoleiman et al. (https://papers.nips.cc/paper/5039-distributed- submodular-maximization-identifying-representative-elements-in-massive- data.pdf).

Briefly, this approach splits the data into m partitions uniformly at random, selects l exemplars from each partition, and then runs a second iteration of greedy selection on the union of l*m exemplars from each partition to get the top k.

Parameters

function : base.SubmodularSelection
A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized.
m : int
The number of partitions to split the data into.
l : int
The number of exemplars to select from each partition. l*m must be larger than the number of exemplars k that will be selected later on.
optimizer1 : str or base.Optimizer, optional
The optimizer to use in the first stage of the process where l exemplars are selected from each partition. Default is ‘lazy’.
optimizer2 : str or base.Optimizer, optional
The optimizer to use in the second stage where k exemplars are selected from the l*m exemplars returned from the first stage. Default is ‘lazy’.
random_state : int or RandomState or None, optional
The random seed to use for the random selection process.
verbose : bool
Whether to display a progress bar during the optimization process.

Attributes

self.function : base.BaseSelection or None
A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized. If None, will be set by the selector when passed in.
self.m : int
The number of partitions that the data will be split into.
self.l : int
The number of exemplars that will be selected from each partition.
self.verbose : bool
Whether to display a progress bar during the optimization process.
self.gains_ : numpy.ndarray or None
The gain that each example would give the last time that it was evaluated.