Locked History Actions

Diff for "DataProviders"

Differences between revisions 15 and 16
Revision 15 as of 2014-01-29 17:41:50
Size: 8324
Editor: CarlEberhard
Comment:
Revision 16 as of 2014-01-29 19:57:53
Size: 727
Editor: CarlEberhard
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
{{{#!wiki red/solid
This is a work in progress.
}}}
class ColumnarDataProvider( line.RegexLineDataProvider ):
    """
    Data provider that provide a list of columns from the lines of it's source.
Line 5: Line 5:
DataProviders are a framework for easily controlling how data can be provided from a source
(generally, a dataset's file contents). They are meant to be:
    Columns are returned in the order given in indeces, so this provider can
    re-arrange columns.
Line 8: Line 8:
 1. Simple to declare, configure, and associate with some source of data.
 2. Maintain simplicity by allowing piping - sending one provider through one or more others until a final, desired format/query is provided.
 3. Be fast and efficient by allowing narrow queries that provide only specified amounts of data from specified locations.

They are ''not'' meant to be:

 1. Replacements for the tools and workflows through which Galaxy provides reproducibility.
 2. Writable in any sense. They can't alter the original data in any permanent fashion.

Essentially, they are meant to be a 'view' of your data and not the data themselves.

Currently, data providers are only available for the file contents of a dataset.

----
== How to get data using DataProviders ==

Currently, there are two entry points to data providers for datasets:
 * Programmtically (typically in a visualization template or other python script) by calling a datatype's `dataprovider`
 method and passing in (at a minimum) the dataset and the name of a particular format/dataprovider
 * Via the datasets API (which itself calls the `dataprovider` method)


==== datatype.dataprovider ====

If a datatype does not have the provider assigned to the given name, an error is raised. If it does, any additional
parameters to the method are parsed and a python generator is returned. This generator will yield individual data
based on the type of provider (and the additional arguments).

For example, given dataset contents in a tabular file called 'dataset1':
{{{
# yet another data format
1 10 11 110
2 20 22 220

3 30 33 330
}}}

within a visualization template or python script, one could get each line as an array of columnar data by calling:
{{{#!highlight python
for array in dataset1.datatype.dataprovider( dataset1, 'column' ):
    print array
# [ "1", "10", "11", "110" ]
# [ "2", "20", "22", "220" ]
# [ "3", "30", "33", "330" ]
}}}
Note: When using text file datatypes, both comments (lines beginning with '#' - although this is configurable) and
blank lines are stripped from the output.

Pass in additional arguments to filter or configure output by adding keyword arguments to the provider. For example,
to limit the above to only two lines and offset by one line:
{{{#!highlight python
for array in dataset1.datatype.dataprovider( dataset1, 'column', limit=2, offset=1 ):
    print array
# [ "2", "20", "22", "220" ]
# [ "3", "30", "33", "330" ]
}}}


==== The datasets API ====

You can access data providers for a dataset via the datasets API by passing the provider name as an argument (for more
information on how to use the API see [[Learn/API]]).

{{{#!highlight bash
curl 'http://localhost:8080/api/datasets/86cf1d3beeec9f1c?data_type=raw_data&provider=column&limit=2&offset=1&api_key=cf8245802b54146014108216e815d6e4'
{
    "data": [
        [
            "2",
            "20",
            "22",
            "220"
        ],
        [
            "3",
            "30",
            "33",
            "330"
        ]
    ]
}
}}}
Note: that the API returns a [[http://www.json.org/|JSON]] formatted object and the array of data is an attribute of
that object named 'data'. This allows the API to send additional information (such as number of datapoints, metadata,
aggregate information, etc.) as other attributes if needed.

Commonly, the API will be accessed by a javascript client (e.g. the browser window in a visualization). For example,
getting data through the API with [[http://jquery.com/|jQuery]]:

{{{#!highlight javascript
var xhr = jQuery.getJSON( "/api/datasets/86cf1d3beeec9f1c", {
    data_type : 'raw_data',
    provider : 'column',
    limit : 2,
    offset : 1
});
xhr.done( function( response ){
    console.debug( response.data );
    // [["2", "20", "22", "220"], ["3", "30", "33", "330"]]
    // ...do something with data
});
}}}


There are many data providers included in Galaxy already.

For all formats:
 * base: reads directly from the file without any formating or filtering
 * chunk: allows breaking file contents into sections of `chunk_size` bytes and offsetting using `chunk_index`
 * chunk64: as chunk, but encodes each section using base64 encoding

For text based formats:
 * line: reads each line from a file, allowing limit, offset, filter functions (in python), removes blank lines,
 removes commented lines, strips whitespace from beginning and end of lines (all configurable)
 * regex-line: as 'line' provider above also allowing `regex_list`, a list of python regex strings. If a line matches
 one or more of the regex expressions in `regex_list`, the line is output. Also allows inverting the match using
 `invert`.

For tabular formats:
 * column: lines are returned as arrays of column data (as above). Many options are available for this provider
 including:
    * indeces: return only the columns specified by a 0-based, comma separated list of integers (e.g. '0,2,5')
    * column_count: return only the first N columns from each line
    * deliminator: defaults to the tab character but can be used to parse comma separated data as well
    * column_types: a CSV string of python primitive names used to parse each column (e.g. 'str,int,float,bool').
    Can works in tandem with indeces to parse only those columns requested.
 * dict: return each line as a dictionary. Keys used should be sent as `column_names`, a CSV list of strings
 (e.g. column_names='id,start,end'). Based on the `column` provider and allows all options used there.

Dataset specific providers for text and tabular formats:
 * You'll often see these used in the built-in providers (e.g. those found in `datatypes/tabular.py`). They attempt
 to infer the proper column settings for other providers by using a dataset's metadata (column names, types, etc.)
 * dataset-column: as the column provider, but infers `column_types` from metadata (for easier parsing)
 * dataset-dict: as the dict provider, but infers `column_names` from metadata. Names can be overridden by
 still passing in `column_names`.

For interval datatypes:
 * genomic-region and genomic-region-dict: parses and returns chromosome, start, and end data as arrays or dictionaries
 respectively.
 * interval and interval-dict: as genomic-region, but also returns strand and name if set in the metadata.

Other providers can be found within the datatype class definitions for datatypes included in Galaxy.


----
== How to filter and format data using DataProviders ==

Although still a work in progress, you can use several aspects of existing data providers to easily filter data both
with python or through the API.

Python is currently the more powerful option. For example, in a visualization template one could pass a filter
function to a 'genomic-region-dict' provider:
{{{#!highlight python
def position_within_point( self, datum ):
def filter_chr10( datum ):
    chr = datum.split( '\t' )[0]
    if chr == 'chr10':
        return datum
    return None
for region in hda.datatype.dataprovider( hda, 'genomic-region-dict', limit=2, offset=1, filter_fn=filter_chr10 ):
    print region
# {'start': 180404, 'end': 300577, 'chrom': 'chr10'}
# {'start': 180423, 'end': 295729, 'chrom': 'chr10'}
}}}

Through the API, (currently) the easiest way is to use the `regex_list` argument:
{{{#!highlight javascript
var xhr = jQuery.getJSON( "/api/datasets/86cf1d3beeec9f1c", {
    data_type : 'raw_data',
    provider : 'genomic-region-dict',
    limit : 2,
    offset : 1,
    regex_list : '^chr10\\b'
});
xhr.done( function( response ){
    console.debug( response.data );
    // [Object { chrom="chr10", end=300577, start=180404}, Object { chrom="chr10", end=295729, start=180423}]
});
}}}
(Note: the double slash escaping of '\\b' which allows us to send the regex with a proper, final '\b' and not the
ascii bell character)

----
== How to define a new DataProvider ==



----
== Troubleshooting ==
    If any desired index is outside the actual number of columns
    in the source, this provider will None-pad the output and you are guaranteed
    the same number of columns as the number of indeces asked for (even if they
    are filled with None).
    """
    settings = {
        'indeces' : 'list:int',
        'column_count' : 'int',
        'column_types' : 'list:str',
        'parse_columns' : 'bool',
        'deliminator' : 'str'
    }

class ColumnarDataProvider( line.RegexLineDataProvider ):

  • """ Data provider that provide a list of columns from the lines of it's source. Columns are returned in the order given in indeces, so this provider can re-arrange columns. If any desired index is outside the actual number of columns in the source, this provider will None-pad the output and you are guaranteed the same number of columns as the number of indeces asked for (even if they are filled with None). """ settings = {
    • 'indeces' : 'list:int', 'column_count' : 'int', 'column_types' : 'list:str', 'parse_columns' : 'bool', 'deliminator' : 'str'
    }