Mixtures

class apricot.functions.mixture.MixtureSelection(n_samples, functions, weights=None, metric='ignore', initial_subset=None, optimizer='two-stage', optimizer_kwds={}, n_neighbors=None, reservoir=None, max_reservoir_size=1000, n_jobs=1, random_state=None, verbose=False)[source]

A selection approach based on a mixture of submodular functions.

A convenient property of submodular functions is that any linear combination of submodular functions is still submodular. More generally, the linear combination of any number of submodular functions (assuming non-negative weights) is still a submodular function. Because of this, a mixture of submodular functions can be optimized using the same algorithms as an individual submodular function. Mixtures can be useful in situations where there are different important aspects of data that are each submodular.

The general form of a mixture function is

\[f(X) = \sum\limits_{i=1}^{M} \alpha_{i} g_{i}(X)\]

where \(f\) indicates the mixture function, \(M\) is the number of functions in the mixture, \(X\) is a subset, \(\alpha_{i}\) is the weight of the \(i\)-th function and \(g_{i}\) is the \(i\)-th function.

Note

There must be at least two components to the mixture, and all \(\alpha\) must be non-negative.

This class can also be used to add regularizers to the selection procedure. If a submodular function is mixed with another submodular function that acts as a regularizer, such as feature based selection mixed with a custom function measuring some property of the selected subset.

Parameters:
  • n_samples (int) – The number of samples to return.
  • functions (list) – The list of submodular functions to mix together. The submodular functions should be instantiated.
  • weights (list, numpy.ndarray or None, optional) – The relative weight of each submodular function. This is the value that the gain from each submodular function is multiplied by before being added together. The default is equal weight for each function.
  • initial_subset (list, numpy.ndarray or None, optional) – If provided, this should be a list of indices into the data matrix to use as the initial subset, or a group of examples that may not be in the provided data should beused as the initial subset. If indices, the provided array should be one-dimensional. If a group of examples, the data should be 2 dimensional.
  • optimizer (string or optimizers.BaseOptimizer, optional) –

    The optimization approach to use for the selection. Default is ‘two-stage’, which makes selections using the naive greedy algorithm initially and then switches to the lazy greedy algorithm. Must be one of

    ’random’ : randomly select elements (dummy optimizer) ‘modular’ : approximate the function using its modular upper bound ‘naive’ : the naive greedy algorithm ‘lazy’ : the lazy (or accelerated) greedy algorithm ‘approximate-lazy’ : the approximate lazy greedy algorithm ‘two-stage’ : starts with naive and switches to lazy ‘stochastic’ : the stochastic greedy algorithm ‘sample’ : randomly take a subset and perform selection on that ‘greedi’ : the GreeDi distributed algorithm ‘bidirectional’ : the bidirectional greedy algorithm

    Default is ‘two-stage’.

  • optimizer_kwds (dict or None) – A dictionary of arguments to pass into the optimizer object. The keys of this dictionary should be the names of the parameters in the optimizer and the values in the dictionary should be the values that these parameters take. Default is None.
  • reservoir (numpy.ndarray or None) – The reservoir to use when calculating gains in the sieve greedy streaming optimization algorithm in the partial_fit method. Currently only used for graph-based functions. If a numpy array is passed in, it will be used as the reservoir. If None is passed in, will use reservoir sampling to collect a reservoir. Default is None.
  • max_reservoir_size (int) – The maximum size that the reservoir can take. If a reservoir is passed in, this value is set to the size of that array. Default is 1000.
  • n_jobs (int) – The number of threads to use when performing computation in parallel. Currently, this parameter is exposed but does not actually do anything. This will be fixed soon.
  • random_state (int or RandomState or None, optional) – The random seed to use for the random selection process. Only used for stochastic greedy.
  • verbose (bool, optional) – Whether to print output during the selection process.
pq

The priority queue used to implement the lazy greedy algorithm.

Type:PriorityQueue
n_samples

The number of samples to select.

Type:int
submodular_functions

A concave function for transforming feature values, often referred to as phi in the literature.

Type:list
weights

The weights of each submodular function.

Type:numpy.ndarray
ranking

The selected samples in the order of their gain with the first number in the ranking corresponding to the index of the first sample that was selected by the greedy procedure.

Type:numpy.array int
gains

The gain of each sample in the returned set when it was added to the growing subset. The first number corresponds to the gain of the first added sample, the second corresponds to the gain of the second added sample, and so forth.

Type:numpy.array float
fit(X, y=None, sample_weight=None, sample_cost=None)[source]

Run submodular optimization to select the examples.

This method is a wrapper for the full submodular optimization process. It takes in some data set (and optionally labels that are ignored during this process) and selects n_samples from it in the greedy manner specified by the optimizer.

This method will return the selector object itself, not the transformed data set. The transform method will then transform a data set to the selected points, or alternatively one can use the ranking stored in the self.ranking attribute. The fit_transform method will perform both optimization and selection and return the selected items.

Parameters:
  • X (list or numpy.ndarray, shape=(n, d)) – The data set to transform. Must be numeric.
  • y (list or numpy.ndarray or None, shape=(n,), optional) – The labels to transform. If passed in this function will return both the data and th corresponding labels for the rows that have been selected.
  • sample_weight (list or numpy.ndarray or None, shape=(n,), optional) – The weight of each example. Currently ignored in apricot but included to maintain compatibility with sklearn pipelines.
  • sample_cost (list or numpy.ndarray or None, shape=(n,), optional) – The cost of each item. If set, indicates that optimization should be performed with respect to a knapsack constraint.
Returns:

self – The fit step returns this selector object.

Return type:

MixtureSelection

fit_transform(X, y=None, sample_weight=None, sample_cost=None)

Run optimization and select a subset of examples.

This method will first perform the fit step and then perform the transform step, returning a transformed data set.

Parameters:
  • X (list or numpy.ndarray, shape=(n, d)) – The data set to transform. Must be numeric.
  • y (list or numpy.ndarray or None, shape=(n,), optional) – The labels to transform. If passed in this function will return both the data and the corresponding labels for the rows that have been selected. Default is None.
  • sample_weight (list or numpy.ndarray or None, shape=(n,), optional) – The sample weights to transform. If passed in this function will return the selected labels (y) and the selected samples, even if no labels were passed in. Default is None.
  • sample_cost (list or numpy.ndarray or None, shape=(n,), optional) – The cost of each item. If set, indicates that optimization should be performed with respect to a knapsack constraint.
Returns:

  • X_subset (numpy.ndarray, shape=(n_samples, d)) – A subset of the data such that n_samples < n and n_samples is the integer provided at initialization.
  • y_subset (numpy.ndarray, shape=(n_samples,), optional) – The labels that match with the indices of the samples if y is passed in. Only returned if passed in.
  • sample_weight_subset (numpy.ndarray, shape=(n_samples,), optional) – The weight of each example.

transform(X, y=None, sample_weight=None)

Transform a data set to include only the selected examples.

This method will return a selection of X and optionally selections of y and sample_weight. The default setting is to select items based on the ranking determined in the fit step with examples in the same order as that ranking. Optionally, the whole data set can be returned, with the weights corresponding to samples that were not selected set to 0. This setting can be controlled by setting pipeline=True.

Parameters:
  • X (list or numpy.ndarray, shape=(n, d)) – The data set to transform. Must be numeric.
  • y (list or numpy.ndarray or None, shape=(n,), optional) – The labels to transform. If passed in this function will return both the data and the corresponding labels for the rows that have been selected. Default is None.
  • sample_weight (list or numpy.ndarray or None, shape=(n,), optional) – The sample weights to transform. If passed in this function will return the selected labels (y) and the selected samples, even if no labels were passed in. Default is None.
Returns:

  • X_subset (numpy.ndarray, shape=(n_samples, d)) – A subset of the data such that n_samples < n and n_samples is the integer provided at initialization.
  • y_subset (numpy.ndarray, shape=(n_samples,), optional) – The labels that match with the indices of the samples if y is passed in. Only returned if passed in.
  • sample_weight_subset (numpy.ndarray, shape=(n_samples,), optional) – The weight of each example.