Usage¶
GenoPandas provides two main data structures for storing genomics data, the
GenomicDataFrame
class and an AnnotatedMatrix
class. GenomicDataFrames
can be used to store any type of location-based genomic data and provides
efficient querying a genomic intervaltree structure. AnnotatedMatrices are used
to store various types of numeric data in feature-by-sample matrices, together
with an optional sample annotation. Various specializations of the
AnnotatedMatrix provide further specializations for specific data types
(such as gene-expression or copy number data) and include support for various
manipulations and visualizations of these data.
GenomicDataFrames¶
The GenomicDataFrame
class is a subclass of the pandas DataFrame class and
therefore supports the same basic interface as a normal pandas DataFrame.
However, in contrast to pandas DataFrames, GenomicDataFrames are required to
have a MultiIndex containing three levels describing the genomic range of
each row as (chromosome, start_position, end_position). These positions
are zero-based and the start position is inclusive, whilst the end position is
exclusive. This means that positioned data (which does not span a range, but
is located at an exact genomic position) can be described as (chromosome,
start_position, start_position + 1). GenomicDataFrames use this MultiLevel
index to provide efficient querying of genomic ranges using a gloc
indexer,
which is backed by an intervaltree data structure.
Construction¶
From existing DataFrames¶
A GenomicDataFrame can easily be constructed from an existing DataFrame as follows:
from genopandas import GenomicDataFrame
df = pd.DataFrame.from_records(
[('1', 10, 20, 'a'),
('2', 10, 20, 'b'),
('2', 30, 40, 'c')],
columns=['chromosome', 'start', 'end', 'name'])
df = df.set_index(['chromosome', 'start', 'end'])
GenomicDataFrame(df)
Note the setting of the index, which is required for querying the data later. The constructor does currently not check of the presence of the index, but any querying will result in errors if a proper index is missing.
If you want to check for the presence of a suitable index, you can use the
from_df
classmethod, which explicitly checks the index of the given
dataframe:
GenomicDataFrame.from_df(df)
If a positioned dataframe (with two index levels, chromosome and
position) is given to from_df
, this index is automatically expanded to
three levels containing start/end positions. Alternatively,
from_position_df
can be used to explicitly expand positioned data,
which allows the width of the expanded items to be specified using the
width
parameter:
df = pd.DataFrame.from_records(
[('1', 10, 'a'),
('2', 10, 'b'),
('2', 30, 'c')],
columns=['chromosome', 'position', 'name'])
df = df.set_index(['chromosome', 'position'])
GenomicDataFrame.from_position_df(df, width=10)
From (delimited) files¶
GenomicDataFrames can also be read from delimited files using the from_csv
method. This method mimics the pd.read_csv
method, but requires the
index_col
argument to indicate which three columns to use for constructing
the index. Using this approach, a GTF file (for example) can be read as follows:
gdf = GenomicDataFrame.from_csv(
'../data/example.gtf',
sep='\t',
names=('contig', 'source', 'feature', 'start', 'end', 'score',
'strand', 'frame', 'attributes'),
index_col=['contig', 'start', 'end'])
gdf.head()
For several common file types (e.g. BED, GTF), we provide specialized functions
for reading data into a GenomicDataFrame. For example, a GTF file can be read
more easily than above using the from_gtf
method:
gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
gdf.head()
Similarly, bed files can be read using the from_bed
method. See the API
for the full list of supported file formats. Any unsupported file formats can
of course be read using the from_csv
method or using the pandas API, as
illustrated above.
Querying¶
As a subclass of the pandas DataFrame
class, GenomicDataFrames can be
queried in the same manner as normal pandas DataFrames using loc and iloc.
However, GenomicDataFrames also provide an additional indexer under the
gloc
property, which can be used to perform ranged queries over
the GenomicDataFrame.
A simple range query can be performed as follows, in which we select any rows overlapping with bases 79935353-79935455 on chromosome 12:
gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
gdf.gloc['12'][79935353:79935455]
In this query, the gloc
indexer returns a slice object for chromosome 12,
which we then slice using the given range coordinates to only select rows within
the specified range.
Entire chromosomes can be selected by passing a list of chromosome names to
gloc
, which also reorders chromosomes to match the given order:
gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
gdf.gloc[['10', '12']]
More complex queries (e.g. with left/right inclusiveness) can be performed using
the gloc.search
method, which provides the above functionality with some
extra options.
Positioning¶
The gloc
indexer can also be used to extract genomic positions directly,
using the chromosome
, start
and end
attributes of the indexer:
gdf.gloc.chromosome # Chromosome values.
gdf.gloc.start # Start positions.
gdf.gloc.end # End positions.
The available chromosomes and their lengths are available through the
chromosomes
and chromosome_lengths
attributes. Note that
chromosome lengths are inferred from the data if these were not given to
the GenomicDataFrame constructor.
The chromosome lengths can be used to calculate offset start/end positions, in
which the lengths of preceding chromosomes are included in the start/end
positions. This can be useful for plotting data linearly across multiple
chromosomes. These offset positions can be accessed using the start_offset
and the end_offset
attributes.
AnnotatedMatrices¶
AnnotatedMatrix
classes provides functionality for storing a numeric
matrix with features along the rows and samples along the columns, together
with additional metadata describing the samples/features. This format
is ideal for storing data from different types of high-throughput measurements
(such as gene-expression counts or copy number calls) together with the
corresponding sample/feature data.
The base AnnotatedMatrix
class can be used to store values that are indexed
by a set of (named) features, such as gene expression matrices (which contain
counts summarized per gene).
Construction¶
The easiest way to construct an AnnotatedMatrix is using a pre-existing
DataFrame. Sample and feature information can be included by passing a DataFrame
using the sample_data
and feature_data
arguments, respectively. Note
that the indices of these annotations should correspond with the matrix
row/column indices:
from genopandas import AnnotatedMatrix
df = pd.DataFrame({
'sample_1': [1, 2, 3],
'sample_2': [4, 5, 6]
},
index=['gene_a', 'gene_b', 'gene_c'])
sample_data = pd.DataFrame(
{'condition': ['control', 'treated']},
index=['sample_1', 'sample_2'])
AnnotatedMatrix(df, sample_data=sample_data)
Once constructed, the matrix values can be accessed using the values
property, which returns the matrix in DataFrame format. The sample and feature
annotations can be retrieved using the sample_data
and feature_data
attributes.
Subsetting samples/features¶
An AnnotatedMatrix can be subset using the same column/index accessors as pandas DataFrames (e.g., .loc, .iloc and [] for selecting columns). In this case, the AnnotatedMatrix class ensures that feature/sample annotations are kept in line with the subsetted matrix.
Besides this, a number of specialized methods allow subsetting of the matrix
based on the sample/feature annotations. Currently this includes the
query_samples
and dropna_samples
methods, which can be used to query
for specific samples or drop samples with NA values in their annotations,
respectively. In general, these methods follow the API of their pandas
equivalents.
Renaming samples/features¶
Features and/or samples can be renamed using the rename
method, using
the index
parameter for features and the columns
parameter for samples.
The corresponding sample/feature annotations are renamed accordingly.
Melting to ‘tidy’ format¶
Matrices can be ‘melted’ into a tidy format (a.k.a. long format), which may
be more suitable for certain types of processing/visualization than the matrix
format. This type of transformation is performed using the melt
method,
which returns a ‘tidy’ pandas DataFrame. Optionally, the parameters
with_sample_data
and with_feature_data
can be used to indicate whether
sample/feature annotations should be included in the produced DataFrame.
import seaborn as sns
df_long = matrix.melt(with_sample_data=True)
sns.boxplot(data=df_long, x='condition', y='value')
Plotting¶
Several high-level plotting functions are provided for plotting matrix values
in different representations. For example, the plot_heatmap
method is most
useful for plotting a (clustered) overview of the matrix values, with optional
feature/sample annotations:
matrix.plot_heatmap(sample_cols=['condition'])
Similarly, the plot_pca
method can be used to plot a PCA transform of the
matrix values. This transformation can be performed along either the sample or
feature axes and can be colored according to specific sample/feature
annotations:
matrix.plot_pca(hue='condition', axis='samples')
Additionally, the plot_feature
method can be used to create categorical
plots (boxplot, swarmplot or violin plot) of feature values. These plots
can be grouped by different feature characteristics to compare distributions
of feature values across different sample groups:
matrix.plot_feature('gene_a', group='condition')
GenomicMatrices¶
Similar to the GenomicsDataFrame, the GenomicMatrix
class is a specialized
version AnnotatedMatrix
class that supports storing and querying of
genomically-positioned data. Besides this, the GenomicMatrix class provides
additional functionality specific to manipulating and plotting
genomically-oriented data.
Construction¶
GenomicMatrix instances can be constructed in the same manner as AnnotatedMatrices, although the matrix values should be supplied as a GenomicsDataFrame (with a MultiIndex containing three levels):
import numpy as np
import pandas as pd
from genopandas import GenomicMatrix
data = pd.DataFrame({
'chromosome': ['1'] * 50 + ['2'] * 50,
'start': np.hstack([range(0, 500, 10),
range(0, 500, 10)]),
'end': np.hstack([range(10, 510, 10),
range(10, 510, 10)]),
'sample_1': np.hstack([np.random.randn(50),
np.random.randn(50) + 10]),
'sample_2': np.hstack([np.random.randn(20) + 15,
np.random.randn(30) + 0,
np.random.randn(50) + -10])
})
matrix = GenomicMatrix(data)
matrix.head()
Matrix values can also be read from delimited files using the from_csv
method. Besides this, the class also provides a from_csv_condensed
method, which can expand a ‘condensed’ index (such as 1:10-20) to a multi-level
index suitable for GenomicsDataFrames. The regex used for this expansion
can be defined using the index_regex
parameter of this method.
Querying ranges¶
Similar to the GenomicDataFrame class, GenomicMatrices can be subset to
specific genomic ranges using the gloc
indexer. For more details, see the
GenomicDataFrame documentation.
Resampling/imputation¶
For certain analyses or visualizations, it can be useful to resample a
GenomicMatrix at a lower resolution or using specific bin sizes. The
resample
method can be used to resample a matrix to a given bin_size,
optionally starting from a given start position:
matrix.resample(bin_size=20, start=0)
Imputation can be used to impute missing values from surrounding
bins. The impute
method can be used for this purpose, which uses the
rolling median functionality from pandas to impute values from surrounding bins:
matrix.impute(window=11, min_probes=5)
Plotting¶
Similar to the AnnotatedMatrix class, the GenomicMatrix class provides a
plot_heatmap
method for plotting a heatmap of matrix values along a genomic
axis:
resampled = matrix.resample(bin_size=20, start=0)
resampled.plot_heatmap()
The plot_sample
method can be used to plot values for a single sample:
matrix.plot_sample('sample_1', markersize=5)
Specialized matrices¶
Besides the basic AnnotatedMatrix
and GenomicMatrix
classes, a number
of more specalized matrix sub-classes are provided in the genopandas.ngs
module. This currently includes the CnvValueMatrix
and CnvCallMatrix
classes for CNV data and the ExpressionMatrix
class for RNA-seq expression
data. See the respective class documentation for more information on
class-specific functionality.