Locked History Actions

Diff for "DataProviders"

Differences between revisions 10 and 11
Revision 10 as of 2014-01-29 14:44:56
Size: 4321
Editor: CarlEberhard
Comment:
Revision 11 as of 2014-01-29 16:04:17
Size: 5948
Editor: CarlEberhard
Comment:
Deletions are marked like this. Additions are marked like this.
Line 76: Line 76:
            "2", 
            "20", 
            "22", 
            "2",
            "20",
            "22",
Line 80: Line 80:
        ],          ],
Line 82: Line 82:
            "3", 
            "30", 
            "33", 
            "3",
            "30",
            "33",
Line 90: Line 90:
Note: that the API returns a [[http://www.json.org/|JSON]] formatted object and the array of data is an attribute
named 'data'. This allows the API to send additional information (such as number of datapoints, metadata, aggregate
information, etc.)  as other attributes if needed.
Note: that the API returns a [[http://www.json.org/|JSON]] formatted object and the array of data is an attribute of
that object named 'data'. This allows the API to send additional information (such as number of datapoints, metadata,
aggregate information, etc.) as other attributes if needed.
Line 111: Line 111:

There are many data providers included in Galaxy already.

For all formats:
 * base: reads directly from the file without any formating or filtering
 * chunk: allows breaking file contents into sections of `chunk_size` bytes and offsetting using `chunk_index`
 * chunk64: as chunk, but encodes each section using base64 encoding

For text based formats:
 * line: reads each line from a file, allowing limit, offset, filter functions (in python), removes blank lines,
 removes commented lines, strips whitespace from beginning and end of lines (all configurable)
 * regex-line: as 'line' provider above also allowing `regex_list`, a list of python regex strings. If a line matches
 one or more of the regex expressions in `regex_list`, the line is output. Also allows inverting the match using
 `invert`.

For tabular formats:
 * column: lines are returned as arrays of column data (as above). Many options are available for this provider
 including:
    * indeces: return only the columns specified by a 0-based, comma separated list of integers (e.g. '0,2,5')
    * column_count: return only the first N columns from each line
    * deliminator: defaults to the tab character but can be used to parse comma separated data as well
    * column_types: a CSV string of python primitive names used to parse each column (e.g. 'str,int,float,bool').
    Can works in tandem with indeces to parse only those columns requested.
 * dict: return each line as a dictionary. Keys used should be sent as `column_names`, a CSV list of strings
 (e.g. column_names='id,start,end').

Line 112: Line 139:
== How to filter data using DataProviders == == How to filter and format data using DataProviders ==

This is a work in progress.

DataProviders are a framework for easily controlling how data can be provided from a source (generally, a dataset's file contents). They are meant to be:

  1. Simple to declare, configure, and associate with some source of data.
  2. Maintain simplicity by allowing piping - sending one provider through one or more others until a final, desired format/query is provided.
  3. Be fast and efficient by allowing narrow queries that provide only specified amounts of data from specified locations.

They are not meant to be:

  1. Replacements for the tools and workflows through which Galaxy provides reproducibility.
  2. Writable in any sense. They can't alter the original data in any permanent fashion.

Essentially, they are meant to be a 'view' of your data and not the data themselves.

Currently, data providers are only available for the file contents of a dataset.


How to get data using DataProviders

Currently, there are two entry points to data providers for datasets:

  • Programmtically (typically in a visualization template or other python script) by calling a datatype's dataprovider method and passing in (at a minimum) the dataset and the name of a particular format/dataprovider

  • Via the datasets API (which itself calls the dataprovider method)

datatype.dataprovider

If a datatype does not have the provider assigned to the given name, an error is raised. If it does, any additional parameters to the method are parsed and a python generator is returned. This generator will yield individual data based on the type of provider (and the additional arguments).

For example, given dataset contents in a tabular file called 'dataset1':

# yet another data format
1   10  11  110
2   20  22  220

3   30  33  330

within a visualization template or python script, one could get each line as an array of columnar data by calling:

   1 for array in dataset1.datatype.dataprovider( dataset1, 'column' ):
   2     print array
   3 # [ "1", "10", "11", "110" ]
   4 # [ "2", "20", "22", "220" ]
   5 # [ "3", "30", "33", "330" ]

Note: When using text file datatypes, both comments (lines beginning with '#' - although this is configurable) and blank lines are stripped from the output.

Pass in additional arguments to filter or configure output by adding keyword arguments to the provider. For example, to limit the above to only two lines and offset by one line:

   1 for array in dataset1.datatype.dataprovider( dataset1, 'column', limit=2, offset=1 ):
   2     print array
   3 # [ "2", "20", "22", "220" ]
   4 # [ "3", "30", "33", "330" ]

The datasets API

You can access data providers for a dataset via the datasets API by passing the provider name as an argument (for more information on how to use the API see Learn/API).

   1 curl 'http://localhost:8080/api/datasets/86cf1d3beeec9f1c?data_type=raw_data&provider=column&limit=2&offset=1&api_key=cf8245802b54146014108216e815d6e4'
   2 {
   3     "data": [
   4         [
   5             "2",
   6             "20",
   7             "22",
   8             "220"
   9         ],
  10         [
  11             "3",
  12             "30",
  13             "33",
  14             "330"
  15         ]
  16     ]
  17 }

Note: that the API returns a JSON formatted object and the array of data is an attribute of that object named 'data'. This allows the API to send additional information (such as number of datapoints, metadata, aggregate information, etc.) as other attributes if needed.

Commonly, the API will be accessed by a javascript client (e.g. the browser window in a visualization). For example, getting data through the API with jQuery:

   1 var xhr = jQuery.getJSON( "/api/datasets/86cf1d3beeec9f1c", {
   2     data_type : 'raw_data',
   3     provider  : 'column',
   4     limit     : 2,
   5     offset    : 1
   6 });
   7 xhr.done( function( response ){
   8     console.debug( response.data );
   9     // [["2", "20", "22", "220"], ["3", "30", "33", "330"]]
  10     // ...do something with data
  11 });

There are many data providers included in Galaxy already.

For all formats:

  • base: reads directly from the file without any formating or filtering
  • chunk: allows breaking file contents into sections of chunk_size bytes and offsetting using chunk_index

  • chunk64: as chunk, but encodes each section using base64 encoding

For text based formats:

  • line: reads each line from a file, allowing limit, offset, filter functions (in python), removes blank lines, removes commented lines, strips whitespace from beginning and end of lines (all configurable)
  • regex-line: as 'line' provider above also allowing regex_list, a list of python regex strings. If a line matches one or more of the regex expressions in regex_list, the line is output. Also allows inverting the match using invert.

For tabular formats:

  • column: lines are returned as arrays of column data (as above). Many options are available for this provider including:
    • indeces: return only the columns specified by a 0-based, comma separated list of integers (e.g. '0,2,5')
    • column_count: return only the first N columns from each line
    • deliminator: defaults to the tab character but can be used to parse comma separated data as well
    • column_types: a CSV string of python primitive names used to parse each column (e.g. 'str,int,float,bool'). Can works in tandem with indeces to parse only those columns requested.
  • dict: return each line as a dictionary. Keys used should be sent as column_names, a CSV list of strings (e.g. column_names='id,start,end').


How to filter and format data using DataProviders


How to define a new DataProvider


Troubleshooting