=====
Usage
=====

GenoPandas provides two main data structures for storing genomics data, the
``GenomicDataFrame`` class and an ``AnnotatedMatrix`` class. GenomicDataFrames
can be used to store any type of location-based genomic data and provides
efficient querying a genomic intervaltree structure. AnnotatedMatrices are used
to store various types of numeric data in feature-by-sample matrices, together
with an optional sample annotation. Various specializations of the
AnnotatedMatrix provide further specializations for specific data types
(such as gene-expression or copy number data) and include support for various
manipulations and visualizations of these data.

GenomicDataFrames
-----------------

The ``GenomicDataFrame`` class is a subclass of the pandas DataFrame class and
therefore supports the same basic interface as a normal pandas DataFrame.
However, in contrast to pandas DataFrames, GenomicDataFrames are required to
have a MultiIndex containing three levels describing the genomic range of
each row as (chromosome, start_position, end_position). These positions
are zero-based and the start position is inclusive, whilst the end position is
exclusive. This means that positioned data (which does not span a range, but
is located at an exact genomic position) can be described as (chromosome,
start_position, start_position + 1). GenomicDataFrames use this MultiLevel
index to provide efficient querying of genomic ranges using a ``gloc`` indexer,
which is backed by an intervaltree data structure.

Construction
~~~~~~~~~~~~

From existing DataFrames
========================

A GenomicDataFrame can easily be constructed from an existing DataFrame as
follows:

.. code-block:: python

    from genopandas import GenomicDataFrame

    df = pd.DataFrame.from_records(
        [('1', 10, 20, 'a'),
         ('2', 10, 20, 'b'),
         ('2', 30, 40, 'c')],
        columns=['chromosome', 'start', 'end', 'name'])

    df = df.set_index(['chromosome', 'start', 'end'])

    GenomicDataFrame(df)

Note the setting of the index, which is required for querying the data later.
The constructor does currently not check of the presence of the index, but
any querying will result in errors if a proper index is missing.

If you want to check for the presence of a suitable index, you can use the
``from_df`` classmethod, which explicitly checks the index of the given
dataframe:

.. code-block:: python

    GenomicDataFrame.from_df(df)

If a positioned dataframe (with two index levels, chromosome and
position) is given to ``from_df``, this index is automatically expanded to
three levels containing start/end positions. Alternatively,
``from_position_df`` can be used to explicitly expand positioned data,
which allows the width of the expanded items to be specified using the
``width`` parameter:

.. code-block:: python

    df = pd.DataFrame.from_records(
        [('1', 10, 'a'),
         ('2', 10, 'b'),
         ('2', 30, 'c')],
        columns=['chromosome', 'position', 'name'])

    df = df.set_index(['chromosome', 'position'])

    GenomicDataFrame.from_position_df(df, width=10)

From (delimited) files
======================

GenomicDataFrames can also be read from delimited files using the ``from_csv``
method. This method mimics the ``pd.read_csv`` method, but requires the
``index_col`` argument to indicate which three columns to use for constructing
the index. Using this approach, a GTF file (for example) can be read as follows:

.. code-block:: python

    gdf = GenomicDataFrame.from_csv(
        '../data/example.gtf',
        sep='\t',
        names=('contig', 'source', 'feature', 'start', 'end', 'score',
               'strand', 'frame', 'attributes'),
        index_col=['contig', 'start', 'end'])

    gdf.head()

For several common file types (e.g. BED, GTF), we provide specialized functions
for reading data into a GenomicDataFrame. For example, a GTF file can be read
more easily than above using the ``from_gtf`` method:

.. code-block:: python

    gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
    gdf.head()

Similarly, bed files can be read using the ``from_bed`` method. See the API
for the full list of supported file formats. Any unsupported file formats can
of course be read using the ``from_csv`` method or using the pandas API, as
illustrated above.

Querying
~~~~~~~~

As a subclass of the pandas ``DataFrame`` class, GenomicDataFrames can be
queried in the same manner as normal pandas DataFrames using loc and iloc.
However, GenomicDataFrames also provide an additional indexer under the
``gloc`` property, which can be used to perform ranged queries over
the GenomicDataFrame.

A simple range query can be performed as follows, in which we select any rows
overlapping with bases 79935353-79935455 on chromosome 12:

.. code-block:: python

    gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
    gdf.gloc['12'][79935353:79935455]

In this query, the ``gloc`` indexer returns a slice object for chromosome 12,
which we then slice using the given range coordinates to only select rows within
the specified range.

Entire chromosomes can be selected by passing a list of chromosome names to
``gloc``, which also reorders chromosomes to match the given order:

.. code-block:: python

    gdf = gpd.GenomicDataFrame.from_gtf('../data/example.gtf')
    gdf.gloc[['10', '12']]

More complex queries (e.g. with left/right inclusiveness) can be performed using
the ``gloc.search`` method, which provides the above functionality with some
extra options.

Positioning
~~~~~~~~~~~

The ``gloc`` indexer can also be used to extract genomic positions directly,
using the ``chromosome``, ``start`` and ``end`` attributes of the indexer:

.. code-block:: python

    gdf.gloc.chromosome  # Chromosome values.
    gdf.gloc.start       # Start positions.
    gdf.gloc.end         # End positions.

The available chromosomes and their lengths are available through the
``chromosomes`` and ``chromosome_lengths`` attributes. Note that
chromosome lengths are inferred from the data if these were not given to
the GenomicDataFrame constructor.

The chromosome lengths can be used to calculate offset start/end positions, in
which the lengths of preceding chromosomes are included in the start/end
positions. This can be useful for plotting data linearly across multiple
chromosomes. These offset positions can be accessed using the ``start_offset``
and the ``end_offset`` attributes.

AnnotatedMatrices
-----------------

``AnnotatedMatrix`` classes provides functionality for storing a numeric
matrix with features along the rows and samples along the columns, together
with additional metadata describing the samples/features. This format
is ideal for storing data from different types of high-throughput measurements
(such as gene-expression counts or copy number calls) together with the
corresponding sample/feature data.

The base ``AnnotatedMatrix`` class can be used to store values that are indexed
by a set of (named) features, such as gene expression matrices (which contain
counts summarized per gene).

Construction
~~~~~~~~~~~~

The easiest way to construct an AnnotatedMatrix is using a pre-existing
DataFrame. Sample and feature information can be included by passing a DataFrame
using the ``sample_data`` and ``feature_data`` arguments, respectively. Note
that the indices of these annotations should correspond with the matrix
row/column indices:

.. code-block:: python

    from genopandas import AnnotatedMatrix

    df = pd.DataFrame({
            'sample_1': [1, 2, 3],
            'sample_2': [4, 5, 6]
        },
        index=['gene_a', 'gene_b', 'gene_c'])

    sample_data = pd.DataFrame(
        {'condition': ['control', 'treated']},
        index=['sample_1', 'sample_2'])

    AnnotatedMatrix(df, sample_data=sample_data)

Once constructed,  the matrix values can be accessed using the ``values``
property, which returns the matrix in DataFrame format. The sample and feature
annotations can be retrieved using the ``sample_data`` and ``feature_data``
attributes.

Subsetting samples/features
~~~~~~~~~~~~~~~~~~~~~~~~~~~

An AnnotatedMatrix can be subset using the same column/index accessors as
pandas DataFrames (e.g., .loc, .iloc and [] for selecting columns). In this
case, the AnnotatedMatrix class ensures that feature/sample annotations are
kept in line with the subsetted matrix.

Besides this, a number of specialized methods allow subsetting of the matrix
based on the sample/feature annotations. Currently this includes the
``query_samples`` and ``dropna_samples`` methods, which can be used to query
for specific samples or drop samples with NA values in their annotations,
respectively. In general, these methods follow the API of their pandas
equivalents.

Renaming samples/features
~~~~~~~~~~~~~~~~~~~~~~~~~

Features and/or samples can be renamed using the ``rename`` method, using
the ``index`` parameter for features and the ``columns`` parameter for samples.
The corresponding sample/feature annotations are renamed accordingly.

Melting to 'tidy' format
~~~~~~~~~~~~~~~~~~~~~~~~

Matrices can be 'melted' into a tidy format (a.k.a. long format), which may
be more suitable for certain types of processing/visualization than the matrix
format. This type of transformation is performed using the ``melt`` method,
which returns a 'tidy' pandas DataFrame. Optionally, the parameters
``with_sample_data`` and ``with_feature_data`` can be used to indicate whether
sample/feature annotations should be included in the produced DataFrame.

.. code-block:: python

    import seaborn as sns

    df_long = matrix.melt(with_sample_data=True)
    sns.boxplot(data=df_long, x='condition', y='value')

Plotting
~~~~~~~~

Several high-level plotting functions are provided for plotting matrix values
in different representations. For example, the ``plot_heatmap`` method is most
useful for plotting a (clustered) overview of the matrix values, with optional
feature/sample annotations:

.. code-block:: python

    matrix.plot_heatmap(sample_cols=['condition'])

Similarly, the ``plot_pca`` method can be used to plot a PCA transform of the
matrix values. This transformation can be performed along either the sample or
feature axes and can be colored according to specific sample/feature
annotations:

.. code-block:: python

    matrix.plot_pca(hue='condition', axis='samples')

Additionally, the ``plot_feature`` method can be used to create categorical
plots (boxplot, swarmplot or violin plot) of feature values. These plots
can be grouped by different feature characteristics to compare distributions
of feature values across different sample groups:

.. code-block:: python

    matrix.plot_feature('gene_a', group='condition')

GenomicMatrices
---------------

Similar to the GenomicsDataFrame, the ``GenomicMatrix`` class is a specialized
version ``AnnotatedMatrix`` class that supports storing and querying of
genomically-positioned data. Besides this, the GenomicMatrix class provides
additional functionality specific to manipulating and plotting
genomically-oriented data.

Construction
~~~~~~~~~~~~

GenomicMatrix instances can be constructed in the same manner as
AnnotatedMatrices, although the matrix values should be supplied as a
GenomicsDataFrame (with a MultiIndex containing three levels):

.. code-block:: python

    import numpy as np
    import pandas as pd

    from genopandas import GenomicMatrix


    data = pd.DataFrame({
        'chromosome': ['1'] * 50 + ['2'] * 50,
        'start': np.hstack([range(0, 500, 10),
                            range(0, 500, 10)]),
        'end': np.hstack([range(10, 510, 10),
                          range(10, 510, 10)]),
        'sample_1': np.hstack([np.random.randn(50),
                            np.random.randn(50) + 10]),
        'sample_2': np.hstack([np.random.randn(20) + 15,
                            np.random.randn(30) + 0,
                            np.random.randn(50) + -10])
    })

    matrix = GenomicMatrix(data)
    matrix.head()

Matrix values can also be read from delimited files using the ``from_csv``
method. Besides this, the class also provides a ``from_csv_condensed``
method, which can expand a 'condensed' index (such as 1:10-20) to a multi-level
index suitable for GenomicsDataFrames. The regex used for this expansion
can be defined using the ``index_regex`` parameter of this method.

Querying ranges
~~~~~~~~~~~~~~~

Similar to the GenomicDataFrame class, GenomicMatrices can be subset to
specific genomic ranges using the ``gloc`` indexer. For more details, see the
GenomicDataFrame documentation.

Resampling/imputation
~~~~~~~~~~~~~~~~~~~~~

For certain analyses or visualizations, it can be useful to resample a
GenomicMatrix at a lower resolution or using specific bin sizes. The
``resample`` method can be used to resample a matrix to a given bin_size,
optionally starting from a given start position:

.. code-block:: python

    matrix.resample(bin_size=20, start=0)

Imputation can be used to impute missing values from surrounding
bins. The ``impute`` method can be used for this purpose, which uses the
rolling median functionality from pandas to impute values from surrounding bins:

.. code-block:: python

    matrix.impute(window=11, min_probes=5)

Plotting
~~~~~~~~

Similar to the AnnotatedMatrix class, the GenomicMatrix class provides a
``plot_heatmap`` method for plotting a heatmap of matrix values along a genomic
axis:

.. code-block:: python

    resampled = matrix.resample(bin_size=20, start=0)
    resampled.plot_heatmap()

The ``plot_sample`` method can be used to plot values for a single sample:

.. code-block:: python

    matrix.plot_sample('sample_1', markersize=5)


Specialized matrices
--------------------

Besides the basic ``AnnotatedMatrix`` and ``GenomicMatrix`` classes, a number
of more specalized matrix sub-classes are provided in the ``genopandas.ngs``
module. This currently includes the ``CnvValueMatrix`` and ``CnvCallMatrix``
classes for CNV data and the ``ExpressionMatrix`` class for RNA-seq expression
data. See the respective class documentation for more information on
class-specific functionality.