Usage

GenoPandas provides two main data structures for storing genomics data, the GenomicDataFrame class and an AnnotatedMatrix class. GenomicDataFrames can be used to store any type of location-based genomic data and provides efficient querying a genomic intervaltree structure. AnnotatedMatrices are used to store various types of numeric data in feature-by-sample matrices, together with an optional sample annotation. Various specializations of the AnnotatedMatrix provide further specializations for specific data types (such as gene-expression or copy number data) and include support for various manipulations and visualizations of these data.

GenomicDataFrames

The GenomicDataFrame class is a subclass of the pandas DataFrame class and therefore supports the same basic interface as a normal pandas DataFrame. However, in contrast to pandas DataFrames, GenomicDataFrames are required to have a MultiIndex containing three levels describing the genomic range of each row as (chromosome, start_position, end_position). These positions are zero-based and the start position is inclusive, whilst the end position is exclusive. This means that positioned data (which does not span a range, but is located at an exact genomic position) can be described as (chromosome, start_position, start_position + 1). GenomicDataFrames use this MultiLevel index to provide efficient querying of genomic ranges using a gloc indexer, which is backed by an intervaltree data structure.

Construction

From existing DataFrames

A GenomicDataFrame can easily be constructed from an existing DataFrame as follows:

from genopandas import GenomicDataFrame

df = pd.DataFrame.from_records(
    [('1', 10, 20, 'a'),
     ('2', 10, 20, 'b'),
     ('2', 30, 40, 'c')],
    columns=['chromosome', 'start', 'end', 'name'])

df = df.set_index(['chromosome', 'start', 'end'])

GenomicDataFrame(df)

Note the setting of the index, which is required for querying the data later. The constructor does currently not check of the presence of the index, but any querying will result in errors if a proper index is missing.

If you want to check for the presence of a suitable index, you can use the from_df classmethod, which explicitly checks the index of the given dataframe:

GenomicDataFrame.from_df(df)

If a positioned dataframe (with two index levels, chromosome and position) is given to from_df, this index is automatically expanded to three levels containing start/end positions. Alternatively, from_position_df can be used to explicitly expand positioned data, which allows the width of the expanded items to be specified using the width parameter:

df = pd.DataFrame.from_records(
    [('1', 10, 'a'),
     ('2', 10, 'b'),
     ('2', 30, 'c')],
    columns=['chromosome', 'position', 'name'])

df = df.set_index(['chromosome', 'position'])

GenomicDataFrame.from_position_df(df, width=10)

From (delimited) files

GenomicDataFrames can also be read from delimited files using the from_csv method. This method mimics the pd.read_csv method, but requires the index_col argument to indicate which three columns to use for constructing the index. Using this approach, a GTF file (for example) can be read as follows:

gdf = GenomicDataFrame.from_csv(
    '../data/example.gtf',
    sep='\t',
    names=('contig', 'source', 'feature', 'start', 'end', 'score',
           'strand', 'frame', 'attributes'),
    index_col=['contig', 'start', 'end'])

gdf.head()

For several common file types (e.g. BED, GTF), we provide specialized functions for reading data into a GenomicDataFrame. For example, a GTF file can be read more easily than above using the from_gtf method:

gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
gdf.head()

Similarly, bed files can be read using the from_bed method. See the API for the full list of supported file formats. Any unsupported file formats can of course be read using the from_csv method or using the pandas API, as illustrated above.

Querying

As a subclass of the pandas DataFrame class, GenomicDataFrames can be queried in the same manner as normal pandas DataFrames using loc and iloc. However, GenomicDataFrames also provide an additional indexer under the gloc property, which can be used to perform ranged queries over the GenomicDataFrame.

A simple range query can be performed as follows, in which we select any rows overlapping with bases 79935353-79935455 on chromosome 12:

gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
gdf.gloc['12'][79935353:79935455]

In this query, the gloc indexer returns a slice object for chromosome 12, which we then slice using the given range coordinates to only select rows within the specified range.

Entire chromosomes can be selected by passing a list of chromosome names to gloc, which also reorders chromosomes to match the given order:

gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
gdf.gloc[['10', '12']]

More complex queries (e.g. with left/right inclusiveness) can be performed using the gloc.search method, which provides the above functionality with some extra options.

Positioning

The gloc indexer can also be used to extract genomic positions directly, using the chromosome, start and end attributes of the indexer:

gdf.gloc.chromosome  # Chromosome values.
gdf.gloc.start       # Start positions.
gdf.gloc.end         # End positions.

The available chromosomes and their lengths are available through the chromosomes and chromosome_lengths attributes. Note that chromosome lengths are inferred from the data if these were not given to the GenomicDataFrame constructor.

The chromosome lengths can be used to calculate offset start/end positions, in which the lengths of preceding chromosomes are included in the start/end positions. This can be useful for plotting data linearly across multiple chromosomes. These offset positions can be accessed using the start_offset and the end_offset attributes.

AnnotatedMatrices

AnnotatedMatrix classes provides functionality for storing a numeric matrix with features along the rows and samples along the columns, together with additional metadata describing the samples/features. This format is ideal for storing data from different types of high-throughput measurements (such as gene-expression counts or copy number calls) together with the corresponding sample/feature data.

The base AnnotatedMatrix class can be used to store values that are indexed by a set of (named) features, such as gene expression matrices (which contain counts summarized per gene).

Construction

The easiest way to construct an AnnotatedMatrix is using a pre-existing DataFrame. Sample and feature information can be included by passing a DataFrame using the sample_data and feature_data arguments, respectively. Note that the indices of these annotations should correspond with the matrix row/column indices:

from genopandas import AnnotatedMatrix

df = pd.DataFrame({
        'sample_1': [1, 2, 3],
        'sample_2': [4, 5, 6]
    },
    index=['gene_a', 'gene_b', 'gene_c'])

sample_data = pd.DataFrame(
    {'condition': ['control', 'treated']},
    index=['sample_1', 'sample_2'])

AnnotatedMatrix(df, sample_data=sample_data)

Once constructed, the matrix values can be accessed using the values property, which returns the matrix in DataFrame format. The sample and feature annotations can be retrieved using the sample_data and feature_data attributes.

Subsetting samples/features

An AnnotatedMatrix can be subset using the same column/index accessors as pandas DataFrames (e.g., .loc, .iloc and [] for selecting columns). In this case, the AnnotatedMatrix class ensures that feature/sample annotations are kept in line with the subsetted matrix.

Besides this, a number of specialized methods allow subsetting of the matrix based on the sample/feature annotations. Currently this includes the query_samples and dropna_samples methods, which can be used to query for specific samples or drop samples with NA values in their annotations, respectively. In general, these methods follow the API of their pandas equivalents.

Renaming samples/features

Features and/or samples can be renamed using the rename method, using the index parameter for features and the columns parameter for samples. The corresponding sample/feature annotations are renamed accordingly.

Melting to ‘tidy’ format

Matrices can be ‘melted’ into a tidy format (a.k.a. long format), which may be more suitable for certain types of processing/visualization than the matrix format. This type of transformation is performed using the melt method, which returns a ‘tidy’ pandas DataFrame. Optionally, the parameters with_sample_data and with_feature_data can be used to indicate whether sample/feature annotations should be included in the produced DataFrame.

import seaborn as sns

df_long = matrix.melt(with_sample_data=True)
sns.boxplot(data=df_long, x='condition', y='value')

Plotting

Several high-level plotting functions are provided for plotting matrix values in different representations. For example, the plot_heatmap method is most useful for plotting a (clustered) overview of the matrix values, with optional feature/sample annotations:

matrix.plot_heatmap(sample_cols=['condition'])

Similarly, the plot_pca method can be used to plot a PCA transform of the matrix values. This transformation can be performed along either the sample or feature axes and can be colored according to specific sample/feature annotations:

matrix.plot_pca(hue='condition', axis='samples')

Additionally, the plot_feature method can be used to create categorical plots (boxplot, swarmplot or violin plot) of feature values. These plots can be grouped by different feature characteristics to compare distributions of feature values across different sample groups:

matrix.plot_feature('gene_a', group='condition')

GenomicMatrices

Similar to the GenomicsDataFrame, the GenomicMatrix class is a specialized version AnnotatedMatrix class that supports storing and querying of genomically-positioned data. Besides this, the GenomicMatrix class provides additional functionality specific to manipulating and plotting genomically-oriented data.

Construction

GenomicMatrix instances can be constructed in the same manner as AnnotatedMatrices, although the matrix values should be supplied as a GenomicsDataFrame (with a MultiIndex containing three levels):

import numpy as np
import pandas as pd

from genopandas import GenomicMatrix


data = pd.DataFrame({
    'chromosome': ['1'] * 50 + ['2'] * 50,
    'start': np.hstack([range(0, 500, 10),
                        range(0, 500, 10)]),
    'end': np.hstack([range(10, 510, 10),
                      range(10, 510, 10)]),
    'sample_1': np.hstack([np.random.randn(50),
                        np.random.randn(50) + 10]),
    'sample_2': np.hstack([np.random.randn(20) + 15,
                        np.random.randn(30) + 0,
                        np.random.randn(50) + -10])
})

matrix = GenomicMatrix(data)
matrix.head()

Matrix values can also be read from delimited files using the from_csv method. Besides this, the class also provides a from_csv_condensed method, which can expand a ‘condensed’ index (such as 1:10-20) to a multi-level index suitable for GenomicsDataFrames. The regex used for this expansion can be defined using the index_regex parameter of this method.

Querying ranges

Similar to the GenomicDataFrame class, GenomicMatrices can be subset to specific genomic ranges using the gloc indexer. For more details, see the GenomicDataFrame documentation.

Resampling/imputation

For certain analyses or visualizations, it can be useful to resample a GenomicMatrix at a lower resolution or using specific bin sizes. The resample method can be used to resample a matrix to a given bin_size, optionally starting from a given start position:

matrix.resample(bin_size=20, start=0)

Imputation can be used to impute missing values from surrounding bins. The impute method can be used for this purpose, which uses the rolling median functionality from pandas to impute values from surrounding bins:

matrix.impute(window=11, min_probes=5)

Plotting

Similar to the AnnotatedMatrix class, the GenomicMatrix class provides a plot_heatmap method for plotting a heatmap of matrix values along a genomic axis:

resampled = matrix.resample(bin_size=20, start=0)
resampled.plot_heatmap()

The plot_sample method can be used to plot values for a single sample:

matrix.plot_sample('sample_1', markersize=5)

Specialized matrices

Besides the basic AnnotatedMatrix and GenomicMatrix classes, a number of more specalized matrix sub-classes are provided in the genopandas.ngs module. This currently includes the CnvValueMatrix and CnvCallMatrix classes for CNV data and the ExpressionMatrix class for RNA-seq expression data. See the respective class documentation for more information on class-specific functionality.