GreeDi

An approach for optimizing submodular functions on massive data sets.

GreeDi is an optimizer that was designed to work on data sets that are too large to fit into memory. The approach involves first partitioning the data into \(m\) equally sized chunks without any overlap. Then, \(l\) elements are selected from each chunk using a standard optimizer like naive or lazy greedy. Finally, these \(ml\) examples are merged and a standard optimizer selects \(k\) examples from this set. In this manner, the algorithm sacrifices exactness to ensure that memory limitations are not an issue.

There are a few considerations to keep in mind when using GreeDi. Naturally, \(ml\) must both be larger than \(k\) and also small enough to fit into memory. The larger \(l\), the closer the solution is to the exact solution but also the more compute time is required. Conversely, the larger \(m\) is, the less exact the solution is. When using a graph-based function, increasing \(m\) can dramatically reduce the amount of computation that needs to be performed, as well as the memory requirements, because the similarity matrix becomes smaller in size. However, feature-based functions are likely to see less of a speed improvement because the cost of evaluating an example is independent of the size of ground set.

param function:A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized.
type function:base.BaseSelection
param m:The number of partitions to split the data into.
type m:int
param l:The number of exemplars to select from each partition. l*m must be larger than the number of exemplars k that will be selected later on.
type l:int
param optimizer1:
 The optimizer to use in the first stage of the process where l exemplars are selected from each partition. Default is ‘lazy’.
type optimizer1:
 str or base.Optimizer, optional
param optimizer2:
 The optimizer to use in the second stage where k exemplars are selected from the l*m exemplars returned from the first stage. Default is ‘lazy’.
type optimizer2:
 str or base.Optimizer, optional
param random_state:
 The random seed to use for the random selection process.
type random_state:
 int or RandomState or None, optional
param verbose:Whether to display a progress bar during the optimization process.
type verbose:bool
self.function

A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized. If None, will be set by the selector when passed in.

Type:base.BaseSelection or None
self.m

The number of partitions that the data will be split into.

Type:int
self.l

The number of exemplars that will be selected from each partition.

Type:int
self.verbose

Whether to display a progress bar during the optimization process.

Type:bool
self.gains_

The gain that each example would give the last time that it was evaluated.

Type:numpy.ndarray or None