GreeDi¶
An approach for optimizing submodular functions on massive data sets.
GreeDi is an optimizer that was designed to work on data sets that are too large to fit into memory. The approach involves first partitioning the data into \(m\) equally sized chunks without any overlap. Then, \(l\) elements are selected from each chunk using a standard optimizer like naive or lazy greedy. Finally, these \(ml\) examples are merged and a standard optimizer selects \(k\) examples from this set. In this manner, the algorithm sacrifices exactness to ensure that memory limitations are not an issue.
There are a few considerations to keep in mind when using GreeDi. Naturally, \(ml\) must both be larger than \(k\) and also small enough to fit into memory. The larger \(l\), the closer the solution is to the exact solution but also the more compute time is required. Conversely, the larger \(m\) is, the less exact the solution is. When using a graphbased function, increasing \(m\) can dramatically reduce the amount of computation that needs to be performed, as well as the memory requirements, because the similarity matrix becomes smaller in size. However, featurebased functions are likely to see less of a speed improvement because the cost of evaluating an example is independent of the size of ground set.
param function:  A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized. 

type function:  base.BaseSelection 
param m:  The number of partitions to split the data into. 
type m:  int 
param l:  The number of exemplars to select from each partition. l*m must be larger than the number of exemplars k that will be selected later on. 
type l:  int 
param optimizer1:  
The optimizer to use in the first stage of the process where l exemplars are selected from each partition. Default is ‘lazy’.  
type optimizer1:  
str or base.Optimizer, optional  
param optimizer2:  
The optimizer to use in the second stage where k exemplars are selected from the l*m exemplars returned from the first stage. Default is ‘lazy’.  
type optimizer2:  
str or base.Optimizer, optional  
param random_state:  
The random seed to use for the random selection process.  
type random_state:  
int or RandomState or None, optional  
param verbose:  Whether to display a progress bar during the optimization process. 
type verbose:  bool 

self.
function
¶ A submodular function that implements the _calculate_gains and _select_next methods. This is the function that will be optimized. If None, will be set by the selector when passed in.
Type: base.BaseSelection or None

self.
m
¶ The number of partitions that the data will be split into.
Type: int

self.
l
¶ The number of exemplars that will be selected from each partition.
Type: int

self.
verbose
¶ Whether to display a progress bar during the optimization process.
Type: bool

self.
gains_
¶ The gain that each example would give the last time that it was evaluated.
Type: numpy.ndarray or None