Maximum Coverage

Maximum coverage functions aim to maximize the number of features that have a non-zero element in at least one selected example—there is no marginal benefit to observing a variable in two examples. If each variable is thought to be an item in a set, and the data is a binary matrix where a 1 indicates the item is present in the example and 0 indicates it is not, optimizing a maximum coverage function is a solution to the set coverage problem. These functions are useful when the space of variables is massive and each example only sees a small subset of them, which is a common situation when analyzing text data when the variables are words. The maximum coverage function is an instance of a feature-based function when the concave function is minimum.

The general form of a feature-based function is:

\[f(X) = \sum\limits_{d=1}^{D} \min \left( \sum\limits_{n=1}^{N} X_{i, d}, 1 \right)\]

where \(f\) indicates the function that operates on a subset \(X\) that has \(N\) examples and \(D\) dimensions. Importantly, \(X\) is the subset and not the ground set, meaning that the time it takes to evaluate this function is proportional only to the size of the selected subset and not the size of the full data set, like it is for graph-based functions.

API Reference

This file contains code that implements a selector based on the maximum coverage submodular function.

class apricot.functions.maxCoverage.MaxCoverageSelection(n_samples, initial_subset=None, optimizer='two-stage', optimizer_kwds={}, n_jobs=1, random_state=None, verbose=False)[source]

A selector based off a coverage function.

NOTE: All values in your data must be binary for this selection to work.

This function measures the coverage of the features in a data set. The approach simply counts the number of features that take a value of 1 in at least one example in the selected set. Due to this property, it is likely that the function will saturate fairly quickly when selecting many examples unless there are also many features.

This object can be used to solve the set coverage problem, which is to identify as small a set of examples as possible that cover the entire set of features. One would simply run this approach until the gain is 0, at which point all features have been covered.

See https://www2.cs.duke.edu/courses/fall17/compsci632/scribing/scribe2.pdf where the problem is described as maximum coverage.

Parameters:
n_samples : int

The number of examples to return.

initial_subset : list, numpy.ndarray or None

If provided, this should be a list of indices into the data matrix to use as the initial subset, or a group of examples that may not be in the provided data should beused as the initial subset. If indices, the provided array should be one-dimensional. If a group of examples, the data should be 2 dimensional.

optimizer : string or optimizers.BaseOptimizer, optional

The optimization approach to use for the selection. Default is ‘two-stage’, which makes selections using the naive greedy algorithm initially and then switches to the lazy greedy algorithm. Must be one of

‘random’ : randomly select elements (dummy optimizer) ‘modular’ : approximate the function using its modular upper bound ‘naive’ : the naive greedy algorithm ‘lazy’ : the lazy (or accelerated) greedy algorithm ‘approximate-lazy’ : the approximate lazy greedy algorithm ‘two-stage’ : starts with naive and switches to lazy ‘stochastic’ : the stochastic greedy algorithm ‘sample’ : randomly take a subset and perform selection on that ‘greedi’ : the GreeDi distributed algorithm ‘bidirectional’ : the bidirectional greedy algorithm

Default is ‘two-stage’.

optimizer_kwds : dict, optional

Arguments to pass into the optimizer object upon initialization. Default is {}.

n_jobs : int, optional

The number of cores to use for processing. This value is multiplied by 2 when used to set the number of threads. If set to -1, use all cores and threads. Default is -1.

random_state : int or RandomState or None, optional

The random seed to use for the random selection process. Only used for stochastic greedy.

verbose : bool

Whether to print output during the selection process.

Attributes:
n_samples : int

The number of samples to select.

ranking : numpy.array int

The selected samples in the order of their gain with the first number in the ranking corresponding to the index of the first sample that was selected by the greedy procedure.

gains : numpy.array float

The gain of each sample in the returned set when it was added to the growing subset. The first number corresponds to the gain of the first added sample, the second corresponds to the gain of the second added sample, and so forth.

fit(self, X, y=None, sample_weight=None, sample_cost=None)

Run submodular optimization to select the examples.

This method is a wrapper for the full submodular optimization process. It takes in some data set (and optionally labels that are ignored during this process) and selects n_samples from it in the greedy manner specified by the optimizer.

This method will return the selector object itself, not the transformed data set. The transform method will then transform a data set to the selected points, or alternatively one can use the ranking stored in the self.ranking attribute. The fit_transform method will perform both optimization and selection and return the selected items.

Parameters:
X : list or numpy.ndarray, shape=(n, d)

The data set to transform. Must be numeric.

y : list or numpy.ndarray or None, shape=(n,), optional

The labels to transform. If passed in this function will return both the data and th corresponding labels for the rows that have been selected.

sample_weight : list or numpy.ndarray or None, shape=(n,), optional

The weight of each example. Currently ignored in apricot but included to maintain compatibility with sklearn pipelines.

sample_cost : list or numpy.ndarray or None, shape=(n,), optional

The cost of each item. If set, indicates that optimization should be performed with respect to a knapsack constraint.

Returns:
self : FeatureBasedSelection

The fit step returns this selector object.

fit_transform(self, X, y=None, sample_weight=None, sample_cost=None)

Run optimization and select a subset of examples.

This method will first perform the fit step and then perform the transform step, returning a transformed data set.

Parameters:
X : list or numpy.ndarray, shape=(n, d)

The data set to transform. Must be numeric.

y : list or numpy.ndarray or None, shape=(n,), optional

The labels to transform. If passed in this function will return both the data and the corresponding labels for the rows that have been selected. Default is None.

sample_weight : list or numpy.ndarray or None, shape=(n,), optional

The sample weights to transform. If passed in this function will return the selected labels (y) and the selected samples, even if no labels were passed in. Default is None.

sample_cost : list or numpy.ndarray or None, shape=(n,), optional

The cost of each item. If set, indicates that optimization should be performed with respect to a knapsack constraint.

Returns:
X_subset : numpy.ndarray, shape=(n_samples, d)

A subset of the data such that n_samples < n and n_samples is the integer provided at initialization.

y_subset : numpy.ndarray, shape=(n_samples,), optional

The labels that match with the indices of the samples if y is passed in. Only returned if passed in.

sample_weight_subset : numpy.ndarray, shape=(n_samples,), optional

The weight of each example.

transform(self, X, y=None, sample_weight=None)

Transform a data set to include only the selected examples.

This method will return a selection of X and optionally selections of y and sample_weight. The default setting is to select items based on the ranking determined in the fit step with examples in the same order as that ranking. Optionally, the whole data set can be returned, with the weights corresponding to samples that were not selected set to 0. This setting can be controlled by setting pipeline=True.

Parameters:
X : list or numpy.ndarray, shape=(n, d)

The data set to transform. Must be numeric.

y : list or numpy.ndarray or None, shape=(n,), optional

The labels to transform. If passed in this function will return both the data and the corresponding labels for the rows that have been selected. Default is None.

sample_weight : list or numpy.ndarray or None, shape=(n,), optional

The sample weights to transform. If passed in this function will return the selected labels (y) and the selected samples, even if no labels were passed in. Default is None.

Returns:
X_subset : numpy.ndarray, shape=(n_samples, d)

A subset of the data such that n_samples < n and n_samples is the integer provided at initialization.

y_subset : numpy.ndarray, shape=(n_samples,), optional

The labels that match with the indices of the samples if y is passed in. Only returned if passed in.

sample_weight_subset : numpy.ndarray, shape=(n_samples,), optional

The weight of each example.