Locked History Actions

Learn/CustomGenomes

Custom Genomes

What is a "Custom Reference Genome" ?

A reference genome contains the nucleotide sequence of the chromosomes, scaffolds, or contigs for a single species, representative of a specific genome build or release.

In Galaxy, a custom reference genome is a FASTA formatted dataset that can be used in place of a native reference genome with most tools.

  • custom: a dataset from the history loaded by users

  • native: local or cached by administrators (see Admin/DataPreparation)

Overview

There are three basic steps to using a Custom Reference Genome:

  • obtain a FASTA copy of the target genome

  • FTP the genome to Galaxy and load into a history as a dataset

  • set a tool form's options to use a custom reference genome from the history and select the loaded genome


Screencasts & Tutorials

Screencast

Topic

NGS101-8 Mapping to YOUR Reference

demonstrates BWA-MEM tool form options using a custom genome from the history

Custom Genomes

explains format, usage, and how to load a genome that works with tools


Sources

  • UCSC, Ensembl, NCBI/GenBank
  • Other Research project associated with specific genome projects
  • Internal research projects
  • Selected genomes can be found in "Data Libraries" on Main for use at http://usegalaxy.org. Example: hg19 is available for GATK under that sub-directory.


Format

  • Custom Genomes are required to be in FASTA format

  • The data should be formatted as FASTA prior to upload into Galaxy

  • The dataset will need to be labeled as FASTA after loaded (if not automatically assigned)


Custom Builds

Some tools and functions require that the 'database' attribute is assigned or that a Custom Reference Genome is set up as a Custom Build prior to use. Examples are the tool Extract Genomic DNA, certain Picard tools, and the function Visualization.

Once created, a Custom Build is added to the list Database/Build: on the dataset 'Edit Attributes' and 'Upload File' tool forms and is available for 'Visualizations'. These can be assigned or used just like any other reference genome.

  • Start with an existing fasta Custom Reference Genome in your history
  • Go to the top "User" menu and select "Custom Builds"
  • Enter in the labels (no spaces and no special characters other than "_")
  • Select the fasta Custom Reference Genome
  • Submit and wait for the build to finish loading before assigning to a dataset or using to start a new Visualization
  • Note: It is fine to navigate away from this form and come back to it later to check for status. The larger the fasta file and busier the Galaxy instance is, the longer the processing will take.


Sorting

Many tool expect that reference genomes are sorted in lexicographical order. These tools are often downstream of the initial mapping tools, which means that a large investment in a project has already been made (i.e. a long mapping process), before a problem with sorting pops up in conclusion layer tools. No one likes to start over!

How to avoid? Always sort your FASTA reference genome dataset at the beginning of a project. Many sources only provide sorted genomes, but double checking is your own responsibility, and super easy in Galaxy. So easy that there isn't even a shared workflow, just a recipe (but feel free to make your own):

quick lexicographical sort recipe:

1. Convert Formats -> FASTA-to-Tabular
2. Filter and Sort -> Sort
       on column: c1 
       with flavor: Alphabetical
       everything in: Ascending order
3. Convert Formats -> Tabular-to-FASTA

The above sorting method is for most tools, but not all. In particular, GATK tools have a tool-specific sort order requirement. The Broad Institute FAQ with input format instructions is here.

Troubleshooting

If a custom genome dataset is producing errors, clicking on the green bug icon Images/Icons/bug.png will often provide a description of the problem. This does not automatically submit a bug report, and it is not always necessary to do so, but it is a good way to get more information about why a job is failing

Common problems and solutions:

#

Problem

Symptoms

Tests

Solution

1.

Custom genome not assigned as FASTA format

Dataset not included in custom genome pull down menu on tool forms

check datatype assigned to dataset

Click on the dataset's pencil icon Images/Icons/pencil.png to reach the "Edit Attributes" page, and in the datatypes section, type in "fasta", and save

2.

Incomplete file load

Sometimes none if all steps run in Galaxy, or only downstream as data analysis inconsistencies. Errors can appear if some steps (such as Tophat) are run outside of Galaxy, but later steps (such as Cufflinks) are run in Galaxy.

Use Text Manipulation → Select last lines from a dataset to check last 10 lines to see if file is truncated

Reload (switch to FTP if not using already). Check your FTP client logs if used for prior load. Or just reload.

3.

Extra spaces, extra lines, inconsistent line wrapping, any deviation from strict FASTA format

RNA-seq tools (Cufflinks, Cuffcompare, Cuffmerge, Cuffdiff, but not Tophat) fails with error Error: sequence lines in a FASTA record must have the same length!.

File tested and corrected locally then re-upload or test/fix within Galaxy, then re-run

Start with FASTA manipulation → FASTA Width formatter with a value between 40-80 (60 is common) to reformat wrapping. Next, use Filter and Sort → Select with ">" to examine identifiers. Use a combination of Convert Formats → FASTA-to-Tabular, Text Manipulation tools, then Tabular-to-FASTA to correct. Finally, use Filter and Sort → Select with "^$" to search for empty lines (use "NOT matching" to remove).

4.

Inconsistent line wrapping, common if merging chromosomes from various Genbank records (e.g. primary chroms with mito)

Tools (SAMTools, Extract Genomic DNA, but rarely alignment tools) may complain about unexpected line lengths/missing identifiers.

File tested and corrected locally then re-upload or test/fix within Galaxy, then re-run

Use FASTA manipulation → FASTA Width formatter with a value between 40-80 (60 is common) to reformat and re-run

5.

Unsorted genome

Tools such as Extract Genomic DNA report problems with sequence lengths, or GATK tools result with an error

First try sorting in Galaxy and re-run. If still problem, file tested and corrected locally then re-upload, or test/fix as for #3 above

To sort, follow instructions for Sorting a Custom Genome

6.

Identifier and Description in ">" lines used inconsistently by tools in the same analysis

Will generally manifest as a false genome-mismatch problem. Solution is to get rid of the description content and re-run the workflow.

Double check that the same reference genome was used for all steps and that the 'identifiers' are a match.

To drop the description, Convert Formats → FASTA-to-Tabular splitting the identifier line into 2 columns, then run Convert Formats → Tabular-to-FASTA omitting the column with the description (c2).

7.

Unassigned database

Tools report that no build is available for the supplied reference genome

This occurs with tools that require an assigned database attribute

Create a Custom Build and assign it to the dataset



A problem or not a problem? Certain job errors with RNA-seq tools can at first appear to look like a format problem with a custom reference genome, but are actually a bit more complicated...

  • Cufflinks/merge/diff reports a missing/problem transcripts.gtf file. This generally indicates a mismatch in the chromosome identifiers between the reference genome used for the original (Tophat) alignment, the reference annotation GTF data, and the reference genome.
  • The problem can sometimes be corrected by altering the chromosome identifiers in the GTF file or the reference genome (see the RNA-seq FAQ: http://main.g2.bx.psu.edu/u/jeremy/p/transcriptome-analysis-faq).

  • A quick solution is to not use the GTF file and/or to turn off the bias correction option on the tool form.

  • The best solution is to use the same exact reference genome for all steps in the same analysis pipeline. Alignment tools (BWA, Bowtie, Tophat) are generally tolerant of minor formatting problems with reference genomes. However, downstream tools tend to have more stringent format requirements. To avoid having to reprocess, a best practice is to verify that the formatting is correct before any steps are started.