Documentation

model

Copyright 2016-2017 ETH Zurich, Eirini Arvaniti and Manfred Claassen.

This module contains functions for performing a CellCnn analysis.

class cellCnn.model.CellCnn(ncell=200, nsubset=1000, per_sample=False, subset_selection='random', maxpool_percentages=[0.01, 1.0, 5.0, 20.0, 100.0], scale=True, quant_normed=False, nfilter_choice=[3, 4, 5, 6, 7, 8, 9], dropout='auto', dropout_p=0.5, coeff_l1=0, coeff_l2=0.0001, learning_rate=None, regression=False, max_epochs=20, patience=5, nrun=15, dendrogram_cutoff=0.4, accur_thres=0.95, verbose=1)[source]

Creates a CellCnn model.

Args:
  • ncell :
    Number of cells per multi-cell input.
  • nsubset :
    Total number of multi-cell inputs that will be generated per class, if per_sample = False. Total number of multi-cell inputs that will be generated from each input sample, if per_sample = True.
  • per_sample :
    Whether the nsubset argument refers to each class or each input sample. For regression problems, it is automatically set to True.
  • subset_selection :
    Can be ‘random’ or ‘outlier’. Generate multi-cell inputs uniformly at random or biased towards outliers. The latter option is only relevant for detection of extremely rare (frequency < 0.1%) cell populations.
  • maxpool_percentages :
    A list specifying candidate percentages of cells that will be max-pooled per filter. For instance, mean pooling corresponds to maxpool_percentages = [100].
  • nfilter_choice :
    A list specifying candidate numbers of filters for the neural network.
  • scale :
    Whether to z-transform each feature (mean = 0, standard deviation = 1) prior to training.
  • quant_normed :
    Whether the input samples have already been pre-processed with quantile normalization. In this case, each feature is zero-centered by subtracting 0.5.
  • nrun :
    Number of neural network configurations to try (should be set >= 3).
  • regression :
    Set to True for a regression problem. Default is False, which corresponds to a classification setting.
  • learning_rate :
    Learning rate for the Adam optimization algorithm. If set to None, learning rates in the range [0.001, 0.01] will be tried out.
  • dropout :
    Whether to use dropout (at each epoch, set a neuron to zero with probability dropout_p). The default behavior ‘auto’ uses dropout when nfilter > 5.
  • dropout_p :
    The dropout probability.
  • coeff_l1 :
    Coefficient for L1 weight regularization.
  • coeff_l2 :
    Coefficient for L2 weight regularization.
  • max_epochs :
    Maximum number of iterations through the data.
  • patience :
    Number of epochs before early stopping (stops if the validation loss does not decrease anymore).
  • dendrogram_cutoff :
    Cutoff for hierarchical clustering of filter weights. Clustering is performed using cosine similarity, so the cutof should be in [0, 1]. A lower cutoff will generate more clusters.
  • accur_thres :
    Keep filters from models achieving at least this accuracy. If less than 3 models pass the accuracy threshold, keep filters from the best 3 models.
fit(train_samples, train_phenotypes, outdir, valid_samples=None, valid_phenotypes=None, generate_valid_set=True)[source]

Trains a CellCnn model.

Args:
  • train_samples :
    List with input samples (e.g. cytometry samples) as numpy arrays.
  • train_phenotypes :
    List of phenotypes associated with the samples in train_samples.
  • outdir :
    Directory where output will be generated.
  • valid_samples :
    List with samples to be used as validation set while training the network.
  • valid_phenotypes :
    List of phenotypes associated with the samples in valid_samples.
  • generate_valid_set :
    If valid_samples is not provided, generate a validation set from the train_samples.
Returns:

A trained CellCnn model with the additional attribute results. The attribute results is a dictionary with the following entries:

  • clustering_result : clustered filter weights from all runs achieving validation accuracy above the specified threshold accur_thres
  • selected_filters : a consensus filter matrix from the above clustering result
  • best_3_nets : the 3 best models (achieving highest validation accuracy)
  • best_net : the best model
  • w_best_net : filter and output weights of the best model
  • accuracies : list of validation accuracies achieved by different models
  • best_model_index : list index of the best model
  • config : list of neural network configurations used
  • scaler : a z-transform scaler object fitted to the training data
  • n_classes : number of output classes
predict(new_samples, ncell_per_sample=None)[source]

Makes predictions for new samples.

Args:
  • new_samples :
    List with input samples (numpy arrays) for which predictions will be made.
  • ncell_per_sample :
    Size of the multi-cell inputs (only one multi-cell input is created per input sample). If set to None, the size of the multi-cell inputs equals the minimum size in new_samples.
Returns:
y_pred : Phenotype predictions for new_samples.

plotting

Copyright 2016-2017 ETH Zurich, Eirini Arvaniti and Manfred Claassen.

This module contains functions for plotting the results of a CellCnn analysis.

cellCnn.plotting.plot_results(results, samples, phenotypes, labels, outdir, filter_diff_thres=0.2, filter_response_thres=0, response_grad_cutoff=None, stat_test=None, log_yscale=False, group_a='group A', group_b='group B', group_names=None, tsne_ncell=10000, regression=False, clustering=None, add_filter_response=False, percentage_drop_cluster=0.1, min_cluster_freq=0.2, show_filters=True)[source]

Plots the results of a CellCnn analysis.

Args:
  • results :
    Dictionary containing the results of a CellCnn analysis.
  • samples :
    Samples from which to visualize the selected cell populations.
  • phenotypes :
    List of phenotypes corresponding to the provided samples.
  • labels :
    Names of measured markers.
  • outdir :
    Output directory where the generated plots will be stored.
  • filter_diff_thres :
    Threshold that defines which filters are most discriminative. Given an array filter_diff of average cell filter response differences between classes, sorted in decreasing order, keep a filter i, i > 0 if it holds that filter_diff[i-1] - filter_diff[i] < filter_diff_thres * filter_diff[i-1]. For regression problems, the array filter_diff contains Kendall’s tau values for each filter.
  • filter_response_thres :
    Threshold for choosing a responding cell population. Default is 0.
  • response_grad_cutoff :
    Threshold on the gradient of the cell filter response CDF, might be useful for defining the selected cell population.
  • stat_test: None | ‘ttest’ | ‘mannwhitneyu’
    Optionally, perform a statistical test on selected cell population frequencies between two groups and report the corresponding p-value on the boxplot figure (see plots description below). Default is None. Currently only used for binary classification problems.
  • group_a :
    Name of the first class.
  • group_b :
    Name of the second class.
  • group_names :
    List of names for the different phenotype classes.
  • log_yscale :
    If True, display the y-axis of the boxplot figure (see plots description below) in logarithmic scale.
  • clustering: None | ‘dbscan’ | ‘louvain’
    Post-processing option for selected cell populations. Default is None.
  • tsne_ncell :
    Number of cells to include in t-SNE calculations and plots.
  • regression :
    Whether it is a regression problem.
  • show_filters :
    Whether to plot learned filter weights.
Returns:

A list with the indices and corresponding cell filter response thresholds of selected discriminative filters. This function also produces a collection of plots for model interpretation. These plots are stored in outdir. They comprise the following:

  • clustered_filter_weights.pdf :
    Filter weight vectors from all trained networks that pass a validation accuracy threshold, grouped in clusters via hierarchical clustering. Each row corresponds to a filter. The last column(s) indicate the weight(s) connecting each filter to the output class(es). Indices on the y-axis indicate the filter cluster memberships, as a result of the hierarchical clustering procedure.
  • consensus_filter_weights.pdf :
    One representative filter per cluster is chosen (the filter with minimum distance to all other memebers of the cluster). We call these selected filters “consensus filters”.
  • best_net_weights.pdf :
    Filter weight vectors of the network that achieved the highest validation accuracy.
  • filter_response_differences.pdf :
    Difference in cell filter response between classes for each consensus filter. To compute this difference for a filter, we first choose a filter-specific class, that’s the class with highest output weight connection to the filter. Then we compute the average cell filter response (value after the pooling layer) for validation samples belonging to the filter-specific class (v1) and the average cell filter response for validation samples not belonging to the filter-specific class (v0). The difference is computed as v1 - v0. For regression problems, we cannot compute a difference between classes. Instead we compute Kendall’s rank correlation coefficient between the predictions of each individual filter (value after the pooling layer) and the true response values. This plot helps decide on a cutoff (filter_diff_thres parameter) for selecting discriminative filters.
  • tsne_all_cells.png :
    Marker distribution overlaid on t-SNE map.

In addition, the following plots are produced for each selected filter (e.g. filter i):

  • cdf_filter_i.pdf :
    Cumulative distribution function of cell filter response for filter i. This plot helps decide on a cutoff (filter_response_thres parameter) for selecting the responding cell population.
  • selected_population_distribution_filter_i.pdf :
    Histograms of univariate marker expression profiles for the cell population selected by filter i vs all cells.
  • selected_population_frequencies_filter_i.pdf :
    Boxplot of selected cell population frequencies in samples of the different classes, if running a classification problem. For regression settings, a scatter plot of selected cell population frequencies vs response variable is generated.
  • tsne_cell_response_filter_i.png :
    Cell filter response overlaid on t-SNE map.
  • tsne_selected_cells_filter_i.png :
    Marker distribution of selected cell population overlaid on t-SNE map.