scanpy.tl.rank_genes_groups

scanpy.tl.rank_genes_groups#

scanpy.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, copy=False, method=None, corr_method='benjamini-hochberg', tie_correct=False, layer=None, **kwds)[source]#

Rank genes for characterizing groups.

Expects logarithmized data.

Parameters:
adata AnnData

Annotated data matrix.

groupby str

The key of the observations grouping to consider.

mask_var ndarray[Any, dtype[bool_]] | str | None (default: None)

Select subset of genes to use in statistical tests.

use_raw bool | None (default: None)

Use raw attribute of adata if present.

layer str | None (default: None)

Key from adata.layers whose value will be used to perform tests on.

groups Union[Literal['all'], Iterable[str]] (default: 'all')

Subset of groups, e.g. ['g1', 'g2', 'g3'], to which comparison shall be restricted, or 'all' (default), for all groups. Note that if reference='rest' all groups will still be used as the reference, not just those specified in groups.

reference str (default: 'rest')

If 'rest', compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.

n_genes int | None (default: None)

The number of genes that appear in the returned tables. Defaults to all genes.

method Optional[Literal['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']] (default: None)

The default method is 't-test', 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'logreg' uses logistic regression. See [Ntranos18], here and here, for why this is meaningful.

corr_method Literal['benjamini-hochberg', 'bonferroni'] (default: 'benjamini-hochberg')

p-value correction method. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon'.

tie_correct bool (default: False)

Use tie correction for 'wilcoxon' scores. Used only for 'wilcoxon'.

rankby_abs bool (default: False)

Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.

pts bool (default: False)

Compute the fraction of cells expressing the genes.

key_added str | None (default: None)

The key in adata.uns information is saved to.

copy bool (default: False)

Whether to copy adata or modify it inplace.

kwds

Are passed to test methods. Currently this affects only parameters that are passed to sklearn.linear_model.LogisticRegression. For instance, you can pass penalty='l1' to try to come up with a minimal set of genes that are good predictors (sparse solution meaning few non-zero fitted coefficients).

Return type:

AnnData | None

Returns:

Returns None if copy=False, else returns an AnnData object. Sets the following fields:

adata.uns['rank_genes_groups' | key_added]['names']structured numpy.ndarray (dtype object)

Structured array to be indexed by group id storing the gene names. Ordered according to scores.

adata.uns['rank_genes_groups' | key_added]['scores']structured numpy.ndarray (dtype object)

Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.

adata.uns['rank_genes_groups' | key_added]['logfoldchanges']structured numpy.ndarray (dtype object)

Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.

adata.uns['rank_genes_groups' | key_added]['pvals']structured numpy.ndarray (dtype float)

p-values.

adata.uns['rank_genes_groups' | key_added]['pvals_adj']structured numpy.ndarray (dtype float)

Corrected p-values.

adata.uns['rank_genes_groups' | key_added]['pts']pandas.DataFrame (dtype float)

Fraction of cells expressing the genes for each group.

adata.uns['rank_genes_groups' | key_added]['pts_rest']pandas.DataFrame (dtype float)

Only if reference is set to 'rest'. Fraction of cells from the union of the rest of each group expressing the genes.

Notes

There are slight inconsistencies depending on whether sparse or dense data are passed. See here.

Examples

>>> import scanpy as sc
>>> adata = sc.datasets.pbmc68k_reduced()
>>> sc.tl.rank_genes_groups(adata, 'bulk_labels', method='wilcoxon')
>>> # to visualize the results
>>> sc.pl.rank_genes_groups(adata)