Locked History Actions

Diff for "DataProviders"

Differences between revisions 14 and 15
Revision 14 as of 2014-01-29 17:18:52
Size: 7682
Editor: CarlEberhard
Revision 15 as of 2014-01-29 17:41:50
Size: 8324
Editor: CarlEberhard
Deletions are marked like this. Additions are marked like this.
Line 173: Line 173:

Through the API, (currently) the easiest way is to use the `regex_list` argument:
{{{#!highlight javascript
var xhr = jQuery.getJSON( "/api/datasets/86cf1d3beeec9f1c", {
    data_type : 'raw_data',
    provider : 'genomic-region-dict',
    limit : 2,
    offset : 1,
    regex_list : '^chr10\\b'
xhr.done( function( response ){
    console.debug( response.data );
    // [Object { chrom="chr10", end=300577, start=180404}, Object { chrom="chr10", end=295729, start=180423}]
(Note: the double slash escaping of '\\b' which allows us to send the regex with a proper, final '\b' and not the
ascii bell character)

This is a work in progress.

DataProviders are a framework for easily controlling how data can be provided from a source (generally, a dataset's file contents). They are meant to be:

  1. Simple to declare, configure, and associate with some source of data.
  2. Maintain simplicity by allowing piping - sending one provider through one or more others until a final, desired format/query is provided.
  3. Be fast and efficient by allowing narrow queries that provide only specified amounts of data from specified locations.

They are not meant to be:

  1. Replacements for the tools and workflows through which Galaxy provides reproducibility.
  2. Writable in any sense. They can't alter the original data in any permanent fashion.

Essentially, they are meant to be a 'view' of your data and not the data themselves.

Currently, data providers are only available for the file contents of a dataset.

How to get data using DataProviders

Currently, there are two entry points to data providers for datasets:

  • Programmtically (typically in a visualization template or other python script) by calling a datatype's dataprovider method and passing in (at a minimum) the dataset and the name of a particular format/dataprovider

  • Via the datasets API (which itself calls the dataprovider method)


If a datatype does not have the provider assigned to the given name, an error is raised. If it does, any additional parameters to the method are parsed and a python generator is returned. This generator will yield individual data based on the type of provider (and the additional arguments).

For example, given dataset contents in a tabular file called 'dataset1':

# yet another data format
1   10  11  110
2   20  22  220

3   30  33  330

within a visualization template or python script, one could get each line as an array of columnar data by calling:

   1 for array in dataset1.datatype.dataprovider( dataset1, 'column' ):
   2     print array
   3 # [ "1", "10", "11", "110" ]
   4 # [ "2", "20", "22", "220" ]
   5 # [ "3", "30", "33", "330" ]

Note: When using text file datatypes, both comments (lines beginning with '#' - although this is configurable) and blank lines are stripped from the output.

Pass in additional arguments to filter or configure output by adding keyword arguments to the provider. For example, to limit the above to only two lines and offset by one line:

   1 for array in dataset1.datatype.dataprovider( dataset1, 'column', limit=2, offset=1 ):
   2     print array
   3 # [ "2", "20", "22", "220" ]
   4 # [ "3", "30", "33", "330" ]

The datasets API

You can access data providers for a dataset via the datasets API by passing the provider name as an argument (for more information on how to use the API see Learn/API).

   1 curl 'http://localhost:8080/api/datasets/86cf1d3beeec9f1c?data_type=raw_data&provider=column&limit=2&offset=1&api_key=cf8245802b54146014108216e815d6e4'
   2 {
   3     "data": [
   4         [
   5             "2",
   6             "20",
   7             "22",
   8             "220"
   9         ],
  10         [
  11             "3",
  12             "30",
  13             "33",
  14             "330"
  15         ]
  16     ]
  17 }

Note: that the API returns a JSON formatted object and the array of data is an attribute of that object named 'data'. This allows the API to send additional information (such as number of datapoints, metadata, aggregate information, etc.) as other attributes if needed.

Commonly, the API will be accessed by a javascript client (e.g. the browser window in a visualization). For example, getting data through the API with jQuery:

   1 var xhr = jQuery.getJSON( "/api/datasets/86cf1d3beeec9f1c", {
   2     data_type : 'raw_data',
   3     provider  : 'column',
   4     limit     : 2,
   5     offset    : 1
   6 });
   7 xhr.done( function( response ){
   8     console.debug( response.data );
   9     // [["2", "20", "22", "220"], ["3", "30", "33", "330"]]
  10     // ...do something with data
  11 });

There are many data providers included in Galaxy already.

For all formats:

  • base: reads directly from the file without any formating or filtering
  • chunk: allows breaking file contents into sections of chunk_size bytes and offsetting using chunk_index

  • chunk64: as chunk, but encodes each section using base64 encoding

For text based formats:

  • line: reads each line from a file, allowing limit, offset, filter functions (in python), removes blank lines, removes commented lines, strips whitespace from beginning and end of lines (all configurable)
  • regex-line: as 'line' provider above also allowing regex_list, a list of python regex strings. If a line matches one or more of the regex expressions in regex_list, the line is output. Also allows inverting the match using invert.

For tabular formats:

  • column: lines are returned as arrays of column data (as above). Many options are available for this provider including:
    • indeces: return only the columns specified by a 0-based, comma separated list of integers (e.g. '0,2,5')
    • column_count: return only the first N columns from each line
    • deliminator: defaults to the tab character but can be used to parse comma separated data as well
    • column_types: a CSV string of python primitive names used to parse each column (e.g. 'str,int,float,bool'). Can works in tandem with indeces to parse only those columns requested.
  • dict: return each line as a dictionary. Keys used should be sent as column_names, a CSV list of strings (e.g. column_names='id,start,end'). Based on the column provider and allows all options used there.

Dataset specific providers for text and tabular formats:

  • You'll often see these used in the built-in providers (e.g. those found in datatypes/tabular.py). They attempt to infer the proper column settings for other providers by using a dataset's metadata (column names, types, etc.)

  • dataset-column: as the column provider, but infers column_types from metadata (for easier parsing)

  • dataset-dict: as the dict provider, but infers column_names from metadata. Names can be overridden by still passing in column_names.

For interval datatypes:

  • genomic-region and genomic-region-dict: parses and returns chromosome, start, and end data as arrays or dictionaries respectively.
  • interval and interval-dict: as genomic-region, but also returns strand and name if set in the metadata.

Other providers can be found within the datatype class definitions for datatypes included in Galaxy.

How to filter and format data using DataProviders

Although still a work in progress, you can use several aspects of existing data providers to easily filter data both with python or through the API.

Python is currently the more powerful option. For example, in a visualization template one could pass a filter function to a 'genomic-region-dict' provider:

   1 def position_within_point( self, datum ):
   2 def filter_chr10( datum ):
   3     chr = datum.split( '\t' )[0]
   4     if chr == 'chr10':
   5         return datum
   6     return None
   7 for region in hda.datatype.dataprovider( hda, 'genomic-region-dict', limit=2, offset=1, filter_fn=filter_chr10 ):
   8     print region
   9 # {'start': 180404, 'end': 300577, 'chrom': 'chr10'}
  10 # {'start': 180423, 'end': 295729, 'chrom': 'chr10'}

Through the API, (currently) the easiest way is to use the regex_list argument:

   1 var xhr = jQuery.getJSON( "/api/datasets/86cf1d3beeec9f1c", {
   2     data_type : 'raw_data',
   3     provider  : 'genomic-region-dict',
   4     limit     : 2,
   5     offset    : 1,
   6     regex_list : '^chr10\\b'
   7 });
   8 xhr.done( function( response ){
   9     console.debug( response.data );
  10     // [Object { chrom="chr10", end=300577, start=180404}, Object { chrom="chr10", end=295729, start=180423}]
  11 });

(Note: the double slash escaping of '\\b' which allows us to send the regex with a proper, final '\b' and not the ascii bell character)

How to define a new DataProvider