Usage

Datasets

There main interface of pybiomart is provided by the Dataset class. A Dataset instance can be constructed directly if the name of the dataset and the url of the host are known:

>>> dataset = Dataset(name='hsapiens_gene_ensembl',
>>>                   host='http://www.ensembl.org')

Querying

Dataset instances can be used to query the biomart server using their query method. This method takes an optional argument attributes which specifies the attributes to be retrieved:

>>> dataset.query(attributes=['ensembl_gene_id', 'external_gene_name']})

The query method returns a pandas DataFrame instance, which contains a DataFrame representation of the requested attributes. If no attributes are given, the default attributes of the dataset are used. These default attributes can be identified using the default_attributes property of the dataset. A list of all available attributes can be obtained from the attributes property. Alternatively, a more convenient overview of all attributes can be obtained in DataFrame format using the list_attributes method.

Filtering

Dataset queries can be filtered to avoid fetching unneeded data from the server, thereby reducing the size of the result (and the required bandwidth):

>>> dataset.query(attributes=['ensembl_gene_id', 'external_gene_name'],
>>>               filters={'chromosome_name': ['1','2']})

The available filters depend on the dataset. All available filters can be accessed using the filters property or the list_filters method, the latter of which returns an overview of available filters in a DataFrame format. The type of a filter describes what kind of values can be provided for a filter. For example, boolean filters require a boolean value, string filters require a string value, whilst list filters can take a list of values.

Servers and Marts

If the exact dataset not known, the Server and Mart classes can be used to explore the available marts and datasets on a biomart server. A server instance can be constructed using an optional host url (the url http://www.biomart.org is used by default). This instance can then be used to identify all available marts, either via the marts property or the list_marts method:

>>> server = Server(host='http://www.ensembl.org')
>>> server.list_marts()

Marts can be accessed by using the mart name as an index for the marts property, or directly as an index on the server instance. This mart instance can then similarly be used to identify datasets available in the mart, using the marts datasets property or its list_datasets method:

>>> mart = server['ENSEMBL_MART_ENSEMBL']
>>> mart.list_datasets()

Datasets can be retrieved from a mart instance by using the dataset name as an index on the mart object, or alternatively as an index for its datasets property.

>>> dataset = mart['hsapiens_gene_ensembl']