Locked History Actions

Admin/Tools/DataManagers/HowTo/Define

Defining Data Managers

This page describes how to define a Data Manager.



Data Manager Components

Data Managers are composed of two components:

  • Data Manager configuration (e.g. data_manager_conf.xml)

  • Data Manager Tool

Data Manager Configuration

The Data Manager Configuration (e.g. data_manager_conf.xml) defines the set of available Data Managers using an XML description. Each Data Manager can add entries to one or more Tool Data Tables. For each Tool Data Table under consideration, the expected output entry columns, and how to handle the Data Manager Tool results, are defined.

Data Manager Tool

A Data Manager Tool is a special class of Galaxy Tool. Data Manager Tools do not appear in the standard Tool Panel and can only be accessed by a Galaxy Administrator. Additionally, the initial content of a Data Manager's output file contains a JSON dictionary with a listing of the Tool parameters and Job settings (i.e. they are a type of OutputParameterJSONTool, this is also available for DataSourceTools). There is no requirement for the underlying Data Manager tool to make use of these contents, but they are provided as a handy way to transfer all of the tool and job parameters without requiring a different command-line argument for each necessary piece of information.

The primary difference between a standard Galaxy Tool and a Data Manager Tool is that the primary output dataset of a Data Manager Tool must be a file containing a JSON description of the new entries to add to a Tool Data Table. The on-disk content to be referenced by the Data Manager Tool, if any, is stored within the extra_files_path of the output dataset created by the tool.


Data Manager Server Configuration Options

In your "galaxy.ini" file these settings exist in the [app:main] section:

   1 # Data manager configuration options
   2 enable_data_manager_user_view = True
   3 data_manager_config_file = data_manager_conf.xml 
   4 shed_data_manager_config_file = shed_data_manager_conf.xml 
   5 galaxy_data_manager_data_path = tool-data

Where enable_data_manager_user_view allows non-admin users to view the available data that has been managed.

Where data_manager_config_file defines the local xml file to use for loading the configurations of locally defined data managers.

Where shed_data_manager_config_file defines the local xml file to use for saving and loading the configurations of locally defined data managers.

Where galaxy_data_manager_data_path defines the location to use for storing the files created by Data Managers. When not configured it defaults to the value of tool_data_path.

An example single entry data_manager_config_file

   1 <?xml version="1.0"?>
   2 <data_managers> <!-- The root element -->
   3     <data_manager tool_file="data_manager/fetch_genome_all_fasta.xml" id="fetch_genome_all_fasta"> <data_managers> <!-- Defines a single Data Manager Tool that can update one or more Data Tables -->
   4         <data_table name="all_fasta"> <!-- Defines a Data Table to be modified. -->
   5             <output> <!-- Handle the output of the Data Manager Tool -->
   6                 <column name="value" /> <!-- columns that are going to be specified by the Data Manager Tool -->
   7                 <column name="dbkey" />
   8                 <column name="name" />
   9                 <column name="path" output_ref="out_file" >  <!-- The value of this column will be modified based upon data in "out_file". example value "phiX.fa" -->
  10                     <move type="file"> <!-- Moving a file from the extra files path of "out_file" -->
  11                         <source>${path}</source> <!-- File name within the extra files path -->
  12                         <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/seq/${path}</target> <!-- Target Location to store the file, directories are created as needed -->
  13                     </move>
  14                     <value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/seq/${path}</value_translation> <!-- Store this value in the final Data Table -->
  15                 </column>
  16             </output>
  17         </data_table>
  18         <!-- additional data_tables can be configured from a single Data Manager -->
  19     </data_manager>
  20 </<data_managers>>

An example data_manager/fetch_genome_all_fasta.xml

This Tool Config calls a python script data_manager_fetch_genome_all_fasta.py and provides a single file out_file and the description from the dbkey dropdown menu for input.

The starting contents of out_file contain information from Galaxy about the tool, including input parameter values, in the JSON format. Data Manager tools are expected to be able to parse this file. The Data Manager tool will also put the return output values for its results in this file; additional files to be moved can be placed in the extra_files_path of out_file.

   1 <tool id="data_manager_fetch_genome_all_fasta" name="Reference Genome" version="0.0.1" tool_type="manage_data">
   2     <description>fetching</description>
   3     <command interpreter="python">data_manager_fetch_genome_all_fasta.py "${out_file}" --dbkey_description ${ dbkey.get_display_text() }</command>
   4     <inputs>
   5         <param name="dbkey" type="genomebuild" label="DBKEY to assign to data" />
   6         <param type="text" name="sequence_name" value="" label="Name of sequence" />
   7         <param type="text" name="sequence_desc" value="" label="Description of sequence" />
   8         <param type="text" name="sequence_id" value="" label="ID for sequence" />
   9         <conditional name="reference_source">
  10           <param name="reference_source_selector" type="select" label="Choose the source for the reference genome">
  11             <option value="ucsc">UCSC</option>
  12             <option value="ncbi">NCBI</option>
  13             <option value="url">URL</option>
  14             <option value="history">History</option>
  15             <option value="directory">Directory on Server</option>
  16           </param>
  17           <when value="ucsc">
  18             <param type="text" name="requested_dbkey" value="" label="UCSC's DBKEY for source FASTA" optional="False" />
  19           </when>
  20           <when value="ncbi">
  21             <param type="text" name="requested_identifier" value="" label="NCBI identifier" optional="False" />
  22           </when>
  23           <when value="url">
  24             <param type="text" area="True" name="user_url" value="http://" label="URLs" optional="False" />
  25           </when>
  26           <when value="history">
  27             <param name="input_fasta" type="data" format="fasta" label="FASTA File" multiple="False" optional="False" />
  28           </when>
  29           <when value="directory">
  30             <param type="text" name="fasta_filename" value="" label="Full path to FASTA File on disk" optional="False" />
  31             <param type="boolean" name="create_symlink" truevalue="create_symlink" falsevalue="copy_file" label="Create symlink to orignal data instead of copying" checked="False" />
  32           </when>
  33         </conditional>
  34     </inputs>
  35     <outputs>
  36         <data name="out_file" format="data_manager_json"/>
  37     </outputs>
  38     <!-- 
  39     <tests>
  40         <test>
  41             DON'T FORGET TO DEFINE SOME TOOL TESTS
  42         </test>
  43     </tests>
  44     -->
  45     <help>
  46 **What it does**
  47 
  48 Fetches a reference genome from various sources (UCSC, NCBI, URL, Galaxy History, or a server directory) and populates the "all_fasta" data table.
  49 
  50 ------
  51 
  52 
  53 
  54 .. class:: infomark
  55 
  56 **Notice:** If you leave name, description, or id blank, it will be generated automatically. 
  57 
  58     </help>
  59 </tool>

An example data_manager_fetch_genome_all_fasta.py

   1 #!/usr/bin/env python
   2 #Dan Blankenberg
   3 
   4 import sys
   5 import os
   6 import tempfile
   7 import shutil
   8 import optparse
   9 import urllib2
  10 from ftplib import FTP
  11 import tarfile
  12 
  13 from galaxy.util.json import from_json_string, to_json_string
  14 
  15 
  16 CHUNK_SIZE = 2**20 #1mb
  17 
  18 def cleanup_before_exit( tmp_dir ):
  19     if tmp_dir and os.path.exists( tmp_dir ):
  20         shutil.rmtree( tmp_dir )
  21 
  22 def stop_err(msg):
  23     sys.stderr.write(msg)
  24     sys.exit(1)
  25     
  26 def get_dbkey_id_name( params, dbkey_description=None):
  27     dbkey = params['param_dict']['dbkey']
  28     #TODO: ensure sequence_id is unique and does not already appear in location file
  29     sequence_id = params['param_dict']['sequence_id']
  30     if not sequence_id:
  31         sequence_id = dbkey #uuid.uuid4() generate and use an uuid instead?
  32     
  33     sequence_name = params['param_dict']['sequence_name']
  34     if not sequence_name:
  35         sequence_name = dbkey_description
  36         if not sequence_name:
  37             sequence_name = dbkey
  38     return dbkey, sequence_id, sequence_name
  39 
  40 def download_from_ucsc( data_manager_dict, params, target_directory, dbkey, sequence_id, sequence_name ):
  41     UCSC_FTP_SERVER = 'hgdownload.cse.ucsc.edu'
  42     UCSC_CHROM_FA_FILENAME = 'chromFa.tar.gz' #FIXME: this file is actually variable...
  43     UCSC_DOWNLOAD_PATH = '/goldenPath/%s/bigZips/' + UCSC_CHROM_FA_FILENAME
  44     COMPRESSED_EXTENSIONS = [ '.tar.gz', '.tar.bz2', '.zip', '.fa.gz', '.fa.bz2' ]
  45     
  46     email = params['param_dict']['__user_email__']
  47     if not email:
  48         email = 'anonymous@example.com'
  49 
  50     ucsc_dbkey = params['param_dict']['reference_source']['requested_dbkey'] or dbkey
  51     ftp = FTP( UCSC_FTP_SERVER )
  52     ftp.login( 'anonymous', email )
  53     ucsc_file_name = UCSC_DOWNLOAD_PATH % ucsc_dbkey
  54     
  55     tmp_dir = tempfile.mkdtemp( prefix='tmp-data-manager-ucsc-' )
  56     ucsc_fasta_filename = os.path.join( tmp_dir, UCSC_CHROM_FA_FILENAME )
  57     
  58     fasta_base_filename = "%s.fa" % sequence_id
  59     fasta_filename = os.path.join( target_directory, fasta_base_filename )
  60     fasta_writer = open( fasta_filename, 'wb+' )
  61     
  62     tmp_extract_dir = os.path.join ( tmp_dir, 'extracted_fasta' )
  63     os.mkdir( tmp_extract_dir )
  64     
  65     tmp_fasta = open( ucsc_fasta_filename, 'wb+' )
  66     
  67     ftp.retrbinary( 'RETR %s' % ucsc_file_name, tmp_fasta.write )
  68     
  69     tmp_fasta.seek( 0 )
  70     fasta_tar = tarfile.open( fileobj=tmp_fasta, mode='r:*' )
  71     
  72     fasta_reader = [ fasta_tar.extractfile( member ) for member in fasta_tar.getmembers() ]
  73     
  74     data_table_entry = _stream_fasta_to_file( fasta_reader, target_directory, dbkey, sequence_id, sequence_name )
  75     _add_data_table_entry( data_manager_dict, data_table_entry )
  76     
  77     fasta_tar.close()
  78     tmp_fasta.close()
  79     cleanup_before_exit( tmp_dir )
  80 
  81 def download_from_ncbi( data_manager_dict, params, target_directory, dbkey, sequence_id, sequence_name ):
  82     NCBI_DOWNLOAD_URL = 'http://togows.dbcls.jp/entry/ncbi-nucleotide/%s.fasta' #FIXME: taken from dave's genome manager...why some japan site?
  83     
  84     requested_identifier = params['param_dict']['reference_source']['requested_identifier']
  85     url = NCBI_DOWNLOAD_URL % requested_identifier
  86     fasta_reader = urllib2.urlopen( url )
  87     
  88     data_table_entry = _stream_fasta_to_file( fasta_reader, target_directory, dbkey, sequence_id, sequence_name )
  89     _add_data_table_entry( data_manager_dict, data_table_entry )
  90 
  91 def download_from_url( data_manager_dict, params, target_directory, dbkey, sequence_id, sequence_name ):
  92     urls = filter( bool, map( lambda x: x.strip(), params['param_dict']['reference_source']['user_url'].split( '\n' ) ) )
  93     fasta_reader = [ urllib2.urlopen( url ) for url in urls ]
  94     
  95     data_table_entry = _stream_fasta_to_file( fasta_reader, target_directory, dbkey, sequence_id, sequence_name )
  96     _add_data_table_entry( data_manager_dict, data_table_entry )
  97 
  98 def download_from_history( data_manager_dict, params, target_directory, dbkey, sequence_id, sequence_name ):
  99     #TODO: allow multiple FASTA input files
 100     input_filename = params['param_dict']['reference_source']['input_fasta']
 101     if isinstance( input_filename, list ):
 102         fasta_reader = [ open( filename, 'rb' ) for filename in input_filename ]
 103     else:
 104         fasta_reader = open( input_filename )
 105     
 106     data_table_entry = _stream_fasta_to_file( fasta_reader, target_directory, dbkey, sequence_id, sequence_name )
 107     _add_data_table_entry( data_manager_dict, data_table_entry )
 108 
 109 def copy_from_directory( data_manager_dict, params, target_directory, dbkey, sequence_id, sequence_name ):
 110     input_filename = params['param_dict']['reference_source']['fasta_filename']
 111     create_symlink = params['param_dict']['reference_source']['create_symlink'] == 'create_symlink'
 112     if create_symlink:
 113         data_table_entry = _create_symlink( input_filename, target_directory, dbkey, sequence_id, sequence_name )
 114     else:
 115         if isinstance( input_filename, list ):
 116             fasta_reader = [ open( filename, 'rb' ) for filename in input_filename ]
 117         else:
 118             fasta_reader = open( input_filename )    
 119         data_table_entry = _stream_fasta_to_file( fasta_reader, target_directory, dbkey, sequence_id, sequence_name )
 120     _add_data_table_entry( data_manager_dict, data_table_entry )
 121 
 122 def _add_data_table_entry( data_manager_dict, data_table_entry ):
 123     data_manager_dict['data_tables'] = data_manager_dict.get( 'data_tables', {} )
 124     data_manager_dict['data_tables']['all_fasta'] = data_manager_dict['data_tables'].get( 'all_fasta', [] )
 125     data_manager_dict['data_tables']['all_fasta'].append( data_table_entry )
 126     return data_manager_dict
 127 
 128 def _stream_fasta_to_file( fasta_stream, target_directory, dbkey, sequence_id, sequence_name, close_stream=True ):
 129     fasta_base_filename = "%s.fa" % sequence_id
 130     fasta_filename = os.path.join( target_directory, fasta_base_filename )
 131     fasta_writer = open( fasta_filename, 'wb+' )
 132     
 133     if isinstance( fasta_stream, list ) and len( fasta_stream ) == 1:
 134         fasta_stream = fasta_stream[0]
 135     
 136     if isinstance( fasta_stream, list ):
 137         last_char = None
 138         for fh in fasta_stream:
 139             if last_char not in [ None, '\n', '\r' ]:
 140                 fasta_writer.write( '\n' )
 141             while True:
 142                 data = fh.read( CHUNK_SIZE )
 143                 if data:
 144                     fasta_writer.write( data )
 145                     last_char = data[-1]
 146                 else:
 147                     break
 148             if close_stream:
 149                 fh.close()
 150     else:
 151         while True:
 152             data = fasta_stream.read( CHUNK_SIZE )
 153             if data:
 154                 fasta_writer.write( data )
 155             else:
 156                 break
 157         if close_stream:
 158             fasta_stream.close()
 159     
 160     fasta_writer.close()
 161     
 162     return dict( value=sequence_id, dbkey=dbkey, name=sequence_name, path=fasta_base_filename )
 163 
 164 def _create_symlink( input_filename, target_directory, dbkey, sequence_id, sequence_name ):
 165     fasta_base_filename = "%s.fa" % sequence_id
 166     fasta_filename = os.path.join( target_directory, fasta_base_filename )
 167     os.symlink( input_filename, fasta_filename )
 168     return dict( value=sequence_id, dbkey=dbkey, name=sequence_name, path=fasta_base_filename )
 169 
 170 REFERENCE_SOURCE_TO_DOWNLOAD = dict( ucsc=download_from_ucsc, ncbi=download_from_ncbi, url=download_from_url, history=download_from_history, directory=copy_from_directory )
 171 
 172 
 173 def main():
 174     #Parse Command Line
 175     parser = optparse.OptionParser()
 176     parser.add_option( '-d', '--dbkey_description', dest='dbkey_description', action='store', type="string", default=None, help='dbkey_description' )
 177     (options, args) = parser.parse_args()
 178     
 179     filename = args[0]
 180     
 181     params = from_json_string( open( filename ).read() )
 182     target_directory = params[ 'output_data' ][0]['extra_files_path']
 183     os.mkdir( target_directory )
 184     data_manager_dict = {}
 185     
 186     dbkey, sequence_id, sequence_name = get_dbkey_id_name( params, dbkey_description=options.dbkey_description ) 
 187     
 188     if dbkey in [ None, '', '?' ]:
 189         raise Exception( '"%s" is not a valid dbkey. You must specify a valid dbkey.' % ( dbkey ) )
 190     
 191     #Fetch the FASTA
 192     REFERENCE_SOURCE_TO_DOWNLOAD[ params['param_dict']['reference_source']['reference_source_selector'] ]( data_manager_dict, params, target_directory, dbkey, sequence_id, sequence_name )
 193     
 194     #save info to json file
 195     open( filename, 'wb' ).write( to_json_string( data_manager_dict ) )
 196         
 197 if __name__ == "__main__": main()

Example JSON input to tool, dbkey is sacCer2

   1 {
   2    "param_dict":{
   3       "__datatypes_config__":"/Users/dan/galaxy-central/database/tmp/tmphyQRH3",
   4       "__get_data_table_entry__":"<function get_data_table_entry at 0x10d435b90>",
   5       "userId":"1",
   6       "userEmail":"dan@bx.psu.edu",
   7       "dbkey":"sacCer2",
   8       "sequence_desc":"",
   9       "GALAXY_DATA_INDEX_DIR":"/Users/dan/galaxy-central/tool-data",
  10       "__admin_users__":"dan@bx.psu.edu,dan+2@bx.psu.edu",
  11       "__app__":"galaxy.app:UniverseApplication",
  12       "__user_email__":"dan@bx.psu.edu",
  13       "sequence_name":"",
  14       "GALAXY_DATATYPES_CONF_FILE":"/Users/dan/galaxy-central/database/tmp/tmphyQRH3",
  15       "__user_name__":"danb",
  16       "sequence_id":"",
  17       "reference_source":{
  18          "reference_source_selector":"ncbi",
  19          "requested_identifier":"sacCer2",
  20          "__current_case__":"1"
  21       },
  22       "__new_file_path__":"/Users/dan/galaxy-central/database/tmp",
  23       "__user_id__":"1",
  24       "out_file":"/Users/dan/galaxy-central/database/files/000/dataset_200.dat",
  25       "GALAXY_ROOT_DIR":"/Users/dan/galaxy-central",
  26       "__tool_data_path__":"/Users/dan/galaxy-central/tool-data",
  27       "__root_dir__":"/Users/dan/galaxy-central",
  28       "chromInfo":"/Users/dan/galaxy-central/tool-data/shared/ucsc/chrom/sacCer2.len"
  29    },
  30    "output_data":[
  31       {
  32          "extra_files_path":"/Users/dan/galaxy-central/database/job_working_directory/000/202/dataset_200_files",
  33          "file_name":"/Users/dan/galaxy-central/database/files/000/dataset_200.dat",
  34          "ext":"data_manager_json",
  35          "out_data_name":"out_file",
  36          "hda_id":201,
  37          "dataset_id":200
  38       }
  39    ],
  40    "job_config":{
  41       "GALAXY_ROOT_DIR":"/Users/dan/galaxy-central",
  42       "GALAXY_DATATYPES_CONF_FILE":"/Users/dan/galaxy-central/database/tmp/tmphyQRH3",
  43       "TOOL_PROVIDED_JOB_METADATA_FILE":"galaxy.json"
  44    }
  45 }

Example JSON Output from tool to galaxy, dbkey is sacCer2

   1 {
   2    "data_tables":{
   3       "all_fasta":[
   4          {
   5             "path":"sacCer2.fa",
   6             "dbkey":"sacCer2",
   7             "name":"S. cerevisiae June 2008 (SGD/sacCer2) (sacCer2)",
   8             "value":"sacCer2"
   9          }
  10       ]
  11    }
  12 }

New Entry in Data Table, dbkey is sacCer2

   1 #<unique_build_id>      <dbkey>         <display_name>  <file_path>
   2 sacCer2 sacCer2 S. cerevisiae June 2008 (SGD/sacCer2) (sacCer2) /Users/dan/galaxy-central/tool-data/sacCer2/seq/sacCer2.fa


Admin/Tools/DataManagers