API¶
API documentation for the various genopandas modules/classes.
core¶
frame¶
-
class
genopandas.core.frame.
GenomicDataFrame
(*args, chrom_lengths=None, **kwargs)[source]¶ DataFrame with fast indexing by genomic position.
Requires columns ‘chromosome’, ‘start’ and ‘end’ to be present in the DataFrame, as these columns are used for indexing.
Examples
Constructing from scratch:
>>> df = pd.DataFrame.from_records( ... [('1', 10, 20, 'a'), ('2', 10, 20, 'b'), ('2', 30, 40, 'c')], ... columns=['chromosome', 'start', 'end']) >>> df.set_index(['chromosome', 'start', 'end']) >>> GenomicDataFrame(df)
Reading from a GTF file:
>>> GenomicDataFrame.from_gtf('/path/to/reference.gtf')
Querying by a genomic range:
>>> genomic_df.gloc['2'][30:50]
-
chromosome_lengths
¶ Chromosome lengths.
-
chromosome_offsets
¶ Chromosome offsets (used when plotting chromosomes linearly).
-
classmethod
from_csv
(file_path, index_col, drop_index_col=True, chrom_lengths=None, **kwargs)[source]¶ Creates a GenomicDataFrame from a csv file using
pandas.read_csv
.Parameters: - file_path (str) – Path to file.
- index_col (List[str]) – Columns to use for index. Columns should be indicated by their name. Should contain two entries for positioned data, three entries for ranged data. If not given, the first three columns of the DataFrame are used by default.
- drop_index_col (bool) – Whether to drop the index columns in the DataFrame (True, default) or to drop them from the dataframe (False).
- chrom_lengths (Dict[str, int]) – Chromosome lengths.
- **kwargs – Any extra kwargs are passed to
pandas.read_csv
.
Returns: DataFrame containing the file contents.
Return type:
-
classmethod
from_df
(df, **kwargs)[source]¶ Constructs instance from dataframe containing ranged/positioned data.
-
classmethod
from_position_df
(df, width=1, **kwargs)[source]¶ Constructs instance from positioned dataframe.
-
classmethod
from_records
(records, index_col, columns=None, drop_index_col=True, chrom_lengths=None, **kwargs)[source]¶ Creates a GenomicDataFrame from a structured or record ndarray.
-
gloc
¶ Genomic indexer for querying the dataframe.
-
-
class
genopandas.core.frame.
GenomicIndexer
(gdf)[source]¶ Base GenomicIndexer class used to index GenomicDataFrames.
-
chromosome
¶ Chromosome values.
-
chromosome_lengths
¶ Chromosome lengths.
-
chromosome_offsets
¶ Chromosome offsets.
-
chromosomes
¶ Available chromosomes.
-
end
¶ End positions.
-
end_offset
¶ End positions, offset by chromosome lengths.
-
gdf
¶ The indexed DataFrame.
-
position
¶ Mid positions (between start/end).
Should corrrespond with original positions for expanded positioned data.
-
position_offset
¶ Mid positions (see position), offset by chromosome lengths.
-
search
(chromosome, start, end, strict_left=False, strict_right=False)[source]¶ Searches the DataFrame for rows within given range.
-
start
¶ Start positions.
-
start_offset
¶ Start positions, offset by chromosome lengths.
-
trees
¶ Trees used for indexing the DataFrame.
-
matrix¶
-
class
genopandas.core.matrix.
AnnotatedMatrix
(values, sample_data=None, feature_data=None)[source]¶ AnnotatedMatrix class.
Annotated matrix classes respresent 2D numeric feature-by-sample matrices (with ‘features’ along the rows and samples along the columns), which can be annotated with optional sample_data and feature_data frames that describe the samples. The type of feature varies between different sub-classes, examples being genes (for gene expression matrices) and region-based bins (for copy-number data).
This (base) class mainly contains a variety of methods for querying, subsetting and combining different annotation matrices. General plotting methods are also provided (
plot_heatmap
).Note that the class follows the feature-by-sample convention that is typically followed in biological packages, rather than the sample-by-feature orientation. This has the additional advantage of allowing more complex indices (such as a region-based MultiIndex) for the features, which are more difficult to use for DataFrame columns than for rows.
-
values
¶ pd.DataFrame or AnnotatedMatrix – Matrix values.
-
sample_data
¶ pd.DataFrame – DataFrame containing sample annotations, whose index corresponds with the columns of the matrix.
-
feature_data
¶ pd.DataFrame – DataFrame containing feature annotations, whose index corresponds with the rows of the matrix.
-
drop_duplicate_indices
(axis='index', keep='first')[source]¶ Drops duplicate indices along given axis.
-
melt
(with_sample_data=False, with_feature_data=False, value_name='value')[source]¶ Melts values into ‘tidy’ format, optionally including annotation.
-
pca
(n_components=None, axis='columns', transform=False, with_annotation=False)[source]¶ Performs PCA on matrix.
-
plot_feature
(feature, group=None, kind='box', ax=None, **kwargs)[source]¶ Plots distribution of expression for given feature.
-
plot_heatmap
(cmap='RdBu_r', sample_cols=None, sample_colors=None, feature_cols=None, feature_colors=None, metric='euclidean', method='complete', transpose=False, **kwargs)[source]¶ Plots clustered heatmap of matrix values.
-
plot_pca_variance
(n_components=None, axis='columns', ax=None)[source]¶ Plots variance explained by PCA components.
-
query_samples
(expr)[source]¶ Subsets samples in matrix by querying sample_data with expression.
Similar to the pandas
query
method, this method queries the sample data of the matrix with the given boolean expression. Any samples for which the expression evaluates to True are returned in the resulting AnnotatedMatrix.Parameters: expr (str) – The query string to evaluate. You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b. Returns: Subsetted matrix, containing only the samples for which expr
evaluates to True.Return type: AnnotatedMatrix
-
-
class
genopandas.core.matrix.
GenomicMatrix
(values, sample_data=None, feature_data=None)[source]¶ Class respresenting matrices indexed by genomic positions.
-
expand
()[source]¶ Expands matrix to include values from missing bins.
Assumes rows are regularly spaced with a fixed bin size.
-
classmethod
from_csv
(file_path, index_col, sample_data=None, feature_data=None, sample_mapping=None, feature_mapping=None, drop_cols=None, chrom_lengths=None, read_data_kws=None, **kwargs)[source]¶ Reads values from a csv file.
-
classmethod
from_csv_condensed
(file_path, index_col=0, sample_data=None, feature_data=None, sample_mapping=None, feature_mapping=None, drop_cols=None, chrom_lengths=None, index_regex='(?P<chromosome>\\w+):(?P<start>\\d+)-(?P<end>\\d+)', is_one_based=False, is_inclusive=False, read_data_kws=None, **kwargs)[source]¶ Reads values from a csv file with a condensed index.
-
classmethod
from_df
(values, chrom_lengths=None, **kwargs)[source]¶ Constructs a genomic matrix from the given DataFrame.
-
gloc
¶ Genomic-position indexer.
Used to select rows from the matrix by their genomic position. Interface is the same as for the GenomicDataFrame gloc property (which this method delegates to).
-
tree¶
-
class
genopandas.core.tree.
GenomicIntervalTree
(*args, **kwargs)[source]¶ Datastructure for efficiently accessing genomic objects by position.
-
classmethod
from_tuples
(tuples)[source]¶ Builds an instance from tuples.
Assumes tuples are sorted by chromosome.
-
classmethod
ngs¶
cnv¶
-
class
genopandas.ngs.cnv.
CnvValueMatrix
(values, sample_data=None, feature_data=None)[source]¶ CnvMatrix containing (segmented) logratio values (positions-by-samples).
-
as_segments
(drop_columns=True)[source]¶ Returns matrix as segments (consecutive stetches with same value).
Assumes that values have already been segmented, i.e. that bins in the same segment have been assigned same numeric value.
Parameters: drop_columns (bool) – Whether to drop chromosome, start, end and sample columns after setting the index. Returns: GenomicDataFrame describing genomic segments. Indexed by chromosome, start, end and sample. Note that the sample index is included to avoid duplicate index errors when reindexing in cases where samples have identical segments.
Return type: GenomicDataFrame
-
-
class
genopandas.ngs.cnv.
CnvCallMatrix
(values, sample_data=None, feature_data=None)[source]¶ Cnv matrix containing CNV calls (genes-by-samples).
-
mask_with_controls
(column, mask_value=0.0)[source]¶ Masks calls present in control samples.
Calls are retained if (a) no call is present in the matched control sample, (b) if the sample call is more extreme than the control sample or (c) the sample and control have calls with different signs (loss/gain).
Matched control samples should be indicated by the given column in the sample_data annotation.
-
rna¶
-
class
genopandas.ngs.rna.
ExpressionMatrix
(values, sample_data=None, feature_data=None)[source]¶ Matrix containing (gene) expression values (features-by-samples).
-
classmethod
from_subread
(file_path, sample_data=None, sample_mapping=None, **kwargs)[source]¶ Reads expression from a subread output file.
-
normalize
(size_factors=None, log2=False)[source]¶ Normalizes expression counts for sequencing depth.
Normalizes by dividing sample counts using the given (sample) size factors. If no size factors are given, they are calculated using the median-of-ratios approach used by DESeq2.
Parameters: - size_factors (np.array) – Array of size factors, length should be equal to the number of samples.
- log2 (bool) – Whether to also log2-transform the normalized counts.
Returns: ExpressionMatrix containing normalized counts.
Return type:
-
classmethod