API

API documentation for the various genopandas modules/classes.

core

frame

class genopandas.core.frame.GenomicDataFrame(*args, chrom_lengths=None, **kwargs)[source]

DataFrame with fast indexing by genomic position.

Requires columns ‘chromosome’, ‘start’ and ‘end’ to be present in the DataFrame, as these columns are used for indexing.

Examples

Constructing from scratch:

>>> df = pd.DataFrame.from_records(
...    [('1', 10, 20, 'a'), ('2', 10, 20, 'b'), ('2', 30, 40, 'c')],
...    columns=['chromosome', 'start', 'end'])
>>> df.set_index(['chromosome', 'start', 'end'])
>>> GenomicDataFrame(df)

Reading from a GTF file:

>>> GenomicDataFrame.from_gtf('/path/to/reference.gtf')

Querying by a genomic range:

>>> genomic_df.gloc['2'][30:50]
chromosome_lengths

Chromosome lengths.

chromosome_offsets

Chromosome offsets (used when plotting chromosomes linearly).

classmethod from_csv(file_path, index_col, drop_index_col=True, chrom_lengths=None, **kwargs)[source]

Creates a GenomicDataFrame from a csv file using pandas.read_csv.

Parameters:
  • file_path (str) – Path to file.
  • index_col (List[str]) – Columns to use for index. Columns should be indicated by their name. Should contain two entries for positioned data, three entries for ranged data. If not given, the first three columns of the DataFrame are used by default.
  • drop_index_col (bool) – Whether to drop the index columns in the DataFrame (True, default) or to drop them from the dataframe (False).
  • chrom_lengths (Dict[str, int]) – Chromosome lengths.
  • **kwargs – Any extra kwargs are passed to pandas.read_csv.
Returns:

DataFrame containing the file contents.

Return type:

GenomicDataFrame

classmethod from_df(df, **kwargs)[source]

Constructs instance from dataframe containing ranged/positioned data.

classmethod from_gtf(gtf_path, filter_=None)[source]

Creates a GenomicDataFrame from a GTF file.

classmethod from_position_df(df, width=1, **kwargs)[source]

Constructs instance from positioned dataframe.

classmethod from_records(records, index_col, columns=None, drop_index_col=True, chrom_lengths=None, **kwargs)[source]

Creates a GenomicDataFrame from a structured or record ndarray.

gloc

Genomic indexer for querying the dataframe.

rename_chromosomes(mapping)[source]

Returns copy with renamed chromosomes.

reset_index(level=None, drop=False, col_level=0, col_fill='')[source]

Mirrors pd.DataFrame.reset_index, but returns a standard DataFrame instead of a GenomicDataFrame, as the (genomic) index is being dropped.

class genopandas.core.frame.GenomicIndexer(gdf)[source]

Base GenomicIndexer class used to index GenomicDataFrames.

chromosome

Chromosome values.

chromosome_lengths

Chromosome lengths.

chromosome_offsets

Chromosome offsets.

chromosomes

Available chromosomes.

end

End positions.

end_offset

End positions, offset by chromosome lengths.

gdf

The indexed DataFrame.

position

Mid positions (between start/end).

Should corrrespond with original positions for expanded positioned data.

position_offset

Mid positions (see position), offset by chromosome lengths.

rebuild()[source]

Rebuilds the genomic interval trees.

search(chromosome, start, end, strict_left=False, strict_right=False)[source]

Searches the DataFrame for rows within given range.

start

Start positions.

start_offset

Start positions, offset by chromosome lengths.

trees

Trees used for indexing the DataFrame.

matrix

class genopandas.core.matrix.AnnotatedMatrix(values, sample_data=None, feature_data=None)[source]

AnnotatedMatrix class.

Annotated matrix classes respresent 2D numeric feature-by-sample matrices (with ‘features’ along the rows and samples along the columns), which can be annotated with optional sample_data and feature_data frames that describe the samples. The type of feature varies between different sub-classes, examples being genes (for gene expression matrices) and region-based bins (for copy-number data).

This (base) class mainly contains a variety of methods for querying, subsetting and combining different annotation matrices. General plotting methods are also provided (plot_heatmap).

Note that the class follows the feature-by-sample convention that is typically followed in biological packages, rather than the sample-by-feature orientation. This has the additional advantage of allowing more complex indices (such as a region-based MultiIndex) for the features, which are more difficult to use for DataFrame columns than for rows.

values

pd.DataFrame or AnnotatedMatrix – Matrix values.

sample_data

pd.DataFrame – DataFrame containing sample annotations, whose index corresponds with the columns of the matrix.

feature_data

pd.DataFrame – DataFrame containing feature annotations, whose index corresponds with the rows of the matrix.

classmethod concat(matrices, axis)[source]

Concatenates matrices along given axis.

drop_duplicate_indices(axis='index', keep='first')[source]

Drops duplicate indices along given axis.

dropna_samples(subset=None, how='any', thresh=None)[source]

Drops samples with NAs in sample_data.

melt(with_sample_data=False, with_feature_data=False, value_name='value')[source]

Melts values into ‘tidy’ format, optionally including annotation.

pca(n_components=None, axis='columns', transform=False, with_annotation=False)[source]

Performs PCA on matrix.

plot_feature(feature, group=None, kind='box', ax=None, **kwargs)[source]

Plots distribution of expression for given feature.

plot_heatmap(cmap='RdBu_r', sample_cols=None, sample_colors=None, feature_cols=None, feature_colors=None, metric='euclidean', method='complete', transpose=False, **kwargs)[source]

Plots clustered heatmap of matrix values.

plot_pca(components=(1, 2), axis='columns', ax=None, **kwargs)[source]

Plots PCA of samples.

plot_pca_variance(n_components=None, axis='columns', ax=None)[source]

Plots variance explained by PCA components.

query_samples(expr)[source]

Subsets samples in matrix by querying sample_data with expression.

Similar to the pandas query method, this method queries the sample data of the matrix with the given boolean expression. Any samples for which the expression evaluates to True are returned in the resulting AnnotatedMatrix.

Parameters:expr (str) – The query string to evaluate. You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b.
Returns:Subsetted matrix, containing only the samples for which expr evaluates to True.
Return type:AnnotatedMatrix
rename(index=None, columns=None)[source]

Rename samples/features in the matrix.

to_csv(file_path, sample_data_path=None, feature_data_path=None, **kwargs)[source]

Writes matrix values to a csv file, using pandas’ to_csv method.

class genopandas.core.matrix.GenomicMatrix(values, sample_data=None, feature_data=None)[source]

Class respresenting matrices indexed by genomic positions.

annotate(features, feature_id='gene_id')[source]

Annotates values for given features.

expand()[source]

Expands matrix to include values from missing bins.

Assumes rows are regularly spaced with a fixed bin size.

classmethod from_csv(file_path, index_col, sample_data=None, feature_data=None, sample_mapping=None, feature_mapping=None, drop_cols=None, chrom_lengths=None, read_data_kws=None, **kwargs)[source]

Reads values from a csv file.

classmethod from_csv_condensed(file_path, index_col=0, sample_data=None, feature_data=None, sample_mapping=None, feature_mapping=None, drop_cols=None, chrom_lengths=None, index_regex='(?P<chromosome>\\w+):(?P<start>\\d+)-(?P<end>\\d+)', is_one_based=False, is_inclusive=False, read_data_kws=None, **kwargs)[source]

Reads values from a csv file with a condensed index.

classmethod from_df(values, chrom_lengths=None, **kwargs)[source]

Constructs a genomic matrix from the given DataFrame.

gloc

Genomic-position indexer.

Used to select rows from the matrix by their genomic position. Interface is the same as for the GenomicDataFrame gloc property (which this method delegates to).

impute(window=11, min_probes=5, expand=True)[source]

Imputes nan values from neighboring bins.

plot_heatmap(cmap='RdBu_r', sample_cols=None, sample_colors=None, metric='euclidean', method='complete', transpose=True, cluster=True, **kwargs)[source]

Plots heatmap of gene expression over samples.

plot_sample(sample, ax=None, **kwargs)[source]

Plots values for given sample along genomic axis.

rename_chromosomes(mapping)[source]

Returns copy of matrix with renamed chromosomes.

resample(bin_size, start=None, agg='mean')[source]

Resamples values at given interval by binning.

tree

class genopandas.core.tree.GenomicIntervalTree(*args, **kwargs)[source]

Datastructure for efficiently accessing genomic objects by position.

difference(other)[source]

Returns a new tree, comprising all intervals in self but not in other.

classmethod from_tuples(tuples)[source]

Builds an instance from tuples.

Assumes tuples are sorted by chromosome.

intersection(other)[source]

Returns a new tree of all intervals common to both self and other.

is_empty()[source]

Returns True if tree is empty.

search(chromosome, start, end=None, strict_left=False, strict_right=False)[source]

Searches the tree for objects within given range.

union(other)[source]

Returns a new tree, comprising all intervals from self and other.

ngs

cnv

class genopandas.ngs.cnv.CnvValueMatrix(values, sample_data=None, feature_data=None)[source]

CnvMatrix containing (segmented) logratio values (positions-by-samples).

as_segments(drop_columns=True)[source]

Returns matrix as segments (consecutive stetches with same value).

Assumes that values have already been segmented, i.e. that bins in the same segment have been assigned same numeric value.

Parameters:drop_columns (bool) – Whether to drop chromosome, start, end and sample columns after setting the index.
Returns:GenomicDataFrame describing genomic segments. Indexed by chromosome, start, end and sample.

Note that the sample index is included to avoid duplicate index errors when reindexing in cases where samples have identical segments.

Return type:GenomicDataFrame
to_igv(file_path)[source]

Saves data for viewing in IGV.

class genopandas.ngs.cnv.CnvCallMatrix(values, sample_data=None, feature_data=None)[source]

Cnv matrix containing CNV calls (genes-by-samples).

mask_with_controls(column, mask_value=0.0)[source]

Masks calls present in control samples.

Calls are retained if (a) no call is present in the matched control sample, (b) if the sample call is more extreme than the control sample or (c) the sample and control have calls with different signs (loss/gain).

Matched control samples should be indicated by the given column in the sample_data annotation.

rna

class genopandas.ngs.rna.ExpressionMatrix(values, sample_data=None, feature_data=None)[source]

Matrix containing (gene) expression values (features-by-samples).

classmethod from_subread(file_path, sample_data=None, sample_mapping=None, **kwargs)[source]

Reads expression from a subread output file.

normalize(size_factors=None, log2=False)[source]

Normalizes expression counts for sequencing depth.

Normalizes by dividing sample counts using the given (sample) size factors. If no size factors are given, they are calculated using the median-of-ratios approach used by DESeq2.

Parameters:
  • size_factors (np.array) – Array of size factors, length should be equal to the number of samples.
  • log2 (bool) – Whether to also log2-transform the normalized counts.
Returns:

ExpressionMatrix containing normalized counts.

Return type:

ExpressionMatrix