The process to adding completely new datatypes is not significantly different than the existing process of adding subclassed datatypes, as done in the other tutorial. It is highly recommended that you read that page first, to gain a good understanding of how to add subclassed datatypes. Since there are many existing datatypes in galaxy, it's very likely that will provide a good starting point for you developing new datatypes.
In this real life example, we'll add a datatype named GenBank, to support genbank files.
First, we'll set up a file named csequence.py in lib/galaxy/datatypes/csequence.py. This file could contain some of the standard sequence types, though we'll only implement genbank.
1 """ 2 Classes for all common sequence formats 3 """ 4 5 from galaxy.datatypes import data 6 from galaxy.datatypes.metadata import MetadataElement 7 8 import os 9 import logging 10 11 log = logging.getLogger(__name__) 12 13 class GenBank( data.Text ): 14 """ 15 abstract class for most of the molecule files 16 """ 17 file_ext = "genbank"
This is all you need to get started with a datatype. Now, load it into your datatypes_conf.xml by adding the following line:
<datatype extension="genbank" type="galaxy.datatypes.csequence:GenBank" display_in_upload="True" />
and start up your server. Were you watching the logs carefully? No? Wondering why your module isn't showing up in the upload tool? Well, if you dig through your logs you'll see this message:
galaxy.datatypes.registry ERROR 2014-07-17 12:43:23,939 Error importing datatype module galaxy.datatypes.csequence: 'module' object has no attribute 'csequence' Traceback (most recent call last): File "/home/hxr/work/galaxy-central/lib/galaxy/datatypes/registry.py", line 208, in load_datatypes module = getattr( module, mod ) AttributeError: 'module' object has no attribute 'csequence'
This error comes as a result of the module not being imported by registry.py. You'll need to add your module as an import to the top of registry.py:
1 import csequence
Once you've done this, your server will start up and the datatype will be available. Please note that this problem can be avoided by using the toolshed to store your datatypes. There, this issue will be avoided as galaxy handles imports from toolshed installed datatypes differently than from locally installed datatypes.
Adding a Sniffer
Datatypes can be "sniffed", their formats can be automatically detected from their contents. For GenBank files that's extremely easy to do, the first 5 characters will be LOCUS, according to section 3.4.4 of the specification.
To implement this in our tool we first have to add the relevant sniffing code to our GenBank class in csequence.py
and then we have to register the sniffer in datatypes_conf.xml
Once that's done, restart your server and try uploading a genbank file. You'll notice that the filetype is automatically detected as genbank once the upload is done.
One of the useful things your datatype can do is provide metadata. This is done by adding metadata entries inside your class like this:
Here we have a MetadataElement, accessible in methods with a dataset parameter from dataset.metadata.number_of_sequences. There are a couple relevant functions you'll want to override here:
set_peek( self, dataset, is_multi_byte=False )
set_meta( self, dataset, **kwd )
the set_peek function is used to determine the blurb of text that will appear to users above the preview (first 5 lines of the file, the file peek), informing them about metadata of a sequence. For genbank files, we're probably interested in how many genome/records are contained within a file. To do that, we need to count the number of times the word LOCUS appears as the first five characters of a line. We'll write a function named _count_genbank_sequences
which we'll call in our set_meta function, since we're setting metadata about the file.
Now we'll need to make use of this in our set_peek override:
1 def set_peek( self, dataset, is_multi_byte=False ): 2 if not dataset.dataset.purged: 3 # Add our blurb 4 if (dataset.metadata.number_of_sequences == 1): 5 dataset.blurb = "1 sequence" 6 else: 7 dataset.blurb = "%s sequences" % dataset.metadata.number_of_sequences 8 # Get standard text peek from dataset 9 dataset.peek = data.get_file_peek( dataset.file_name, is_multi_byte=is_multi_byte ) 10 else: 11 dataset.peek = 'file does not exist' 12 dataset.blurb = 'file purged from disk'
This function will be called during metadata setting. Try uploading a multi record genbank file and testing it out. If you don't have a multi-record genbank file, simply concatenate a single file together a couple times and upload that.
By now you should have a complete GenBank parser in csequence.py that looks about like the following:
1 from galaxy.datatypes import data 2 from galaxy.datatypes.metadata import MetadataElement 3 import logging 4 log = logging.getLogger(__name__) 5 6 7 class GenBank( data.Text ): 8 file_ext = "genbank" 9 10 MetadataElement( name="number_of_sequences", default=0, desc="Number of sequences", readonly=True, visible=True, optional=True, no_value=0 ) 11 12 def set_peek( self, dataset, is_multi_byte=False ): 13 if not dataset.dataset.purged: 14 # Add our blurb 15 if (dataset.metadata.number_of_sequences == 1): 16 dataset.blurb = "1 sequence" 17 else: 18 dataset.blurb = "%s sequences" % dataset.metadata.number_of_sequences 19 # Get 20 dataset.peek = data.get_file_peek( dataset.file_name, is_multi_byte=is_multi_byte ) 21 else: 22 dataset.peek = 'file does not exist' 23 dataset.blurb = 'file purged from disk' 24 25 def get_mime(self): 26 return 'text/plain' 27 28 def sniff( self, filename ): 29 header = open(filename).read(5) 30 return header == 'LOCUS' 31 32 def set_meta( self, dataset, **kwd ): 33 """ 34 Set the number of sequences in dataset. 35 """ 36 dataset.metadata.number_of_sequences = self._count_genbank_sequences( dataset.file_name ) 37 38 def _count_genbank_sequences( self, filename ): 39 """ 40 This is not a perfect definition, but should suffice for general usage. It fails to detect any 41 errors that would result in parsing errors like incomplete files. 42 """ 43 # Specification for the genbank file format can be found in 44 # ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt 45 # in section 3.4.4 LOCUS Format 46 count = 0 47 with open( filename ) as gbk: 48 for line in gbk: 49 if line[0:5] == 'LOCUS': 50 count += 1 51 return count