Locked History Actions

Learn/Uploading and Analyzing Genotype Data in Plink Format


Analysing SNP genotype data

Galaxy has a set of tools for single nucleotide polymorphism (SNP) data in Plink compatible formats. Many of the tools run the Plink package directly - others include Eigenstrat and Galaxy specific implementations of Manhatten plots and GRR. There are methods for evaluating and cleaning genotype data including QC reporting, ancestry PCA outlier detection and cryptic relatedness detection, together with statistical methods for case control, GLM and TDT analyses. However, before using these tools on your own data you will need to format your data correctly, then upload it into a Galaxy history.

Data format requirements

Like all Galaxy tools, data must be correctly formatted before upload.

The Rgenetics SNP tools take genotype, pedigree and map data in one of data formats that Plink uses. The first is as text files in linkage pedigree format - two separate files that must always go together - eg mygeno.ped and mygeno.map. Plink also has an internal compressed format that is far more space efficient, and requires three separate files that must always go together - eg mygeno.bim, mygeno.bam and mygeno.fam. Please read the descriptions at the Plink site carefully and make sure your data is correct before proceeding any further - bad data will not work properly and is by far the most common problem we see.

To use the Rgenetics Galaxy tools, you first need to get your data into a Galaxy history. To do this, there is a very important thing you need to do to use the upload tool to upload your genotype data into Galaxy. Normally, the upload tool will "guess" the kind of data you have, but the plink/rgenetics lped and pbed (compressed) formats are special 'composite' Galaxy datatypes. This is because the map and pedigree/genotype files need to be kept together correctly inside Galaxy. As a result, the upload tool requires that you explicitly set the file type so all of the components can be properly uploaded and stored together.

For example, to upload pbed data from your local desktop, choose 'Upload file' from the Get Data tools.

When the upload form appears, you must change the default ('Autodetect') that appears in the filetype select box of the upload form to the specific rgenetics datatype - either 'pbed' if you have bim/bam/fam compressed plink data, or 'lped' if you have uncompressed ped/map plink genotype data. Type the first few letters (eg 'lp' or 'pb') into the first box, and select the right one from the list that appears.

Once this is done, the upload tool form will change dramatically. Instead of one upload box, there will be two or three separate file upload inputs - one each for the plink xxx.bim xxx.bed and xxx.fam where xxx is the name you set when you ran plink to create the files, or for uncompressed linkage format two separate file upload inputs - the .ped and .map files.

For each of the two or three boxes, you can browse for the appropriate file on your local machine - be careful not to mix them up as the upload tool is unable to tell unfortunately. If you have uploaded your data using FTP (see below), select the files from your FTP uploads instead.

At the bottom of the form, change the genome build to the appropriate one (eg hg18 or hg19) for your SNP.

Finally, change the 'metadata value for basename' (which will become the new dataset name in your history at the end of the upload step) to something that will remind you what the data are - something more meaningful than the default 'rgenetics'.

Click 'execute' to upload the data and create the new dataset in your history. Compressed (pbed) format is preferred because it takes less space and the upload is far quicker.

Note that some tools will autoconvert between lped and pbed so there is a delay the first time some tools are run on a new dataset. There are built in converters (use the pencil icon) also if you need them.

FTP upload

Note that the FTP upload method is preferred and is the only way to upload files > 2GB. Use your FTP uploads rather than your local hard disk in the description below. For small (<2GB) data sets uploaded directly from your local computer to the Galaxy server will work and is a little less complicated so is described below - substitute FTP files for your local file system in the instructions where possible.