Locked History Actions

Events/GCC2013/Abstracts



Abstracts



Talk Abstracts

Contents

  1. 1 July: Session 1: Reproducible science
    1. Computational Reproducibility is Crucial for Scientific Software Platforms
    2. Reproducible research and the 90/10 rule: Improving the ratio of light script to dark script matter in your Galaxy
    3. Reproducible and automated processing in high-throughput NGS facilities
  2. 1 July: Session 2: Genome analysis
    1. A Galaxy of learning: Bioinformatics tutorials based on Galaxy
    2. The Genomic HyperBrowser: a Galaxy-based web server for analysis of genomic tracks
    3. modENCODE Galaxy: Uniform ChIP-Seq Processing Tools for modENCODE and ENCODE Data
    4. The Galaxy SlipStream Appliance: Galaxy Made Easy
  3. 1 July: Session 3: Application-specific workflow
    1. Single-cell genomics pipeline: from raw reads to phylogenomics using Galaxy
    2. A layered genotyping-by-sequencing pipeline using Galaxy
    3. The Linked2Safety's Galaxy Based Data Analysis Space
    4. Galaxy as an Integration and Workflow Platform for a Cloud Enabled Bio-medical Image Analysis and Image Processing Toolkit
    5. Representation and statistical analysis of 3D chromatin data in a Galaxy framework
  4. 1 July: Session 4
    1. Ion Torrent Semiconductor Sequencing Update
  5. 2 July: Session 5: Interacting with Galaxy
    1. State of the Galaxy
    2. BioBlend - automating bioinformatics with Galaxy and CloudMan
    3. Extension of Galaxy to Utilize Web Services and A Semantic Suggestion Engine
    4. GTrack 1.0: Unified data format providing customizable representation and high-speed analysis performance within Galaxy
  6. 2 July: Session 6: Extending Galaxy
    1. Globus Genomics - An Integrated End to End Sequencing Analysis Platform Powered by Globus Online and Galaxy
    2. Galaxy-P: Beyond Proteomics
    3. DevOps Ignition to reach Galaxy continuous integration
    4. The Clinical Galaxy: A validated platform initiative
  7. 2 July: Session 7: Exploiting Galaxy
    1. Enhancing the Galaxy Tool Shed
    2. How to Create Your Own Web-based, Interactive Visualizations for Galaxy
    3. Managing Galaxy's Built-in Data
    4. Contributing to Galaxy

1 July: Session 1: Reproducible science

Computational Reproducibility is Crucial for Scientific Software Platforms

Victoria Stodden

Victoria Stodden

Slides, Vimeo

Abstract:

It is now well accepted that computation is emerging as central to the scientific enterprise. With this transformation, it is essential to make available the data and code associated with published results, and the means to replicate and rerun the computational experiments. Software platforms that enable the management of scientific workflow and dissemination of reproducible results from a central part of the future of computational science. In this talk I describe the "reproducible research movement," a grassroots effort across many fields, discuss the rapidly changing federal policies requiring public access to scientific data and results. Finally, I will present challenges facing the reproducible research movement, including cyberinfrastructure design and funding, reward mechanisms, and accessibility.

Biography:

Victoria is an assistant professor of Statistics at Columbia University, and affiliated with the Columbia University Institute for Data Sciences and Engineering. She completed her PhD in statistics and her law degree at Stanford University. Her research centers on the multifaceted problem of enabling reproducibility in computational science. This includes studying adequacy and robustness in replicated results, designing and implementing validation systems, developing standards of openness for data and code sharing, and resolving legal and policy barriers to disseminating reproducible research.

Victoria is also

  • the developer of the Reproducible Research Standard, a suite of open licensing recommendations for the dissemination of computational results.

  • a co-founder of RunMyCode.org, an open platform for disseminating the code and data associated with published results, and enabling independent and public cloud-based verification of methods and findings.

  • creator and curator of SparseLab, a collaborative platform for reproducible computational research in underdetermined systems.


Reproducible research and the 90/10 rule: Improving the ratio of light script to dark script matter in your Galaxy

RossLazarus

Ross Lazarus1, Antony Kaspi1, Mark Ziemann1 and The Galaxy Team 2.

Slides, Vimeo

Scientific progress in biology relies on valid experimental results that can be independently reproduced. This fundamental requirement for good science presents a tough challenge for automated bioinformatic analysis services. Experimental complexities vary widely and rapid rates of change in molecular technologies and in popular tools mean that even well established reproducible automated workflows inevitably require tinkering to get them to work for some data. In typical academic laboratories, well run automated analysis systems might hope to support at best about 80-90% of routine analyses, but not all.

Sometimes a new framework capability is needed, but even if writing a quick script to perform the task is relatively trivial, manually writing and installing the wrapper code to integrate a new framework tool is not. Small scripts may be written and run to transform data outside the framework to complete an analysis if time is constrained. Quick manual fixes are unlikely to be documented well or to reach a source code repository. Functional but hidden, we refer to these as bioinformatic dark script matter (DSM), which in contrast to automated, documented, secure and visible light script matter, may soon be forgotten or lost. Analyses involving DSM may not be reliably reproducible.

An installable Galaxy tool will be demonstrated that reproducibly runs user supplied scripts in popular scripting languages. It optionally generates a complete, new Galaxy tool in Tool Shed shareable form, wrapping the supplied script. The importance and implications of these integrated Galaxy components for minimizing DSM in reproducible research will be briefly reviewed.


Reproducible and automated processing in high-throughput NGS facilities

http://www.crs4.it/crs4/peopledetails/people/148/Luca_Pireddu

Gianmauro Cuccuru1, Giorgio Fotia1, Josh Moore2, Luca Lianas1, Luca Pireddu1, Jason Swedlow2, Gianluigi Zanetti1

  • 1 CRS4, Pula, Sardegna, Italy
    2 Wellcome Trust Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dundee, Scotland, UK

Slides, Vimeo

As the rate of samples to process in high-throughput sequencing facilities increases, performing and tracking the center’s operations becomes increasingly difficult, costly and error prone, while processing the massive amounts of data poses significant computational challenges.

We present our ongoing work to accelerate, automate and track all data-related procedures at the CRS4 Sequencing Platform by integrating Galaxy with other state-of-the-art processing technologies, such as Hadoop, OMERO and iRODS.

In our model, data processing pipelines are implemented as one or more Galaxy workflows. Through our integration work, in addition to conventional tools Galaxy is able to drive high-performance Hadoop-based processing tools. With all workflows, Galaxy tracks the processing steps applied to data through its histories; as data sets are generated, these histories are extracted and stored into our OMERO.biobank, thus documenting the data and ensuring reproducibility. The data itself, on the other hand, is committed to iRODS, hence providing a single file repository, independent from the storage infrastructure. A custom “automator” daemon is the final component required to drive the system. It launches and monitors Galaxy workflows, links workflows to each other – e.g., execute sample-based workflows after a flowcell-based workflow – and passes information between components – i.e., saves data sets and histories in OMERO.biobank, commits files to iRODS, etc.

Currently, the system is in its testing phase and is on schedule to be in production at CRS4 by May 2013. In addition, future extensions will allow it to be used to process data from other sources, such as mass spectrometers and digital microscopes.


1 July: Session 2: Genome analysis

A Galaxy of learning: Bioinformatics tutorials based on Galaxy

Simon Gladman

Simon Gladman1,2, Mahtab Mirmomeni1, Andrew Lonie1

  • 1 Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative, Melbourne, Victoria, Australia.
    2 Victorian Bioinformatics Consortium, Monash University, Victoria, Australia.

Slides, Vimeo

The Australian government has funded the development of a Genomics Virtual Laboratory (GVL): a set of analysis & visualisation platforms (currently Galaxy and UCSC genome browser) implemented on the Australian Research Cloud infrastructure (http://www.nectar.org.au/research-cloud), and community resources for best practice genomics, including protocols, workflows and tutorials for common genomics tasks. 

We have written a number of tutorials for common bioinformatic tasks using Galaxy as the delivery platform. The areas covered include, de novo assembly, variant calling (basic and advanced), DGE analysis and others. The tutorials use real data and best practice tools to teach users the concepts of the analyses without the hassle of them having to learn the command line at the same time.

The process of developing these tutorials involved: Designing the workflow; wrapping the relevant tools into Galaxy with their associated scripts and tool/repository dependencies; making the data sets available via published Galaxy histories; making the tools available/installable via the toolshed; and production of the tutorial documentation. The tutorials were then extensively tested before being presented to the Australian genomics research community.

This talk will be about our experiences in the development of the tutorials including the incorporation of the the tools into Galaxy, the use of the toolshed and the feedback received from giving the tutorials.

The tutorials and other associated resources are freely available at http://www.genome.edu.au.


The Genomic HyperBrowser: a Galaxy-based web server for analysis of genomic tracks

Geir Kjetil Sandve Elvind Hovig

Geir K Sandve1,2, Sveinung Gundersen3, Morten Johansen4, Ingrid K Glad5, Krishanthi Gunathasan6, Lars Holden7, Marit Holden7, Knut Liestøl1,2, Ståle Nygård8, Vegard Nygaard4, Jonas Paulsen1,4, Halfdan Rydbeck1,3,7, Kai Trengereid1, Trevor Clancy3, Finn Drabløs9, Egil Ferkingstad7, Matúš Kalaš10,11, Tonje Lien5, Morten B Rye9, Arnoldo Frigessi7,12 and Eivind Hovig1,3,4,7

  • 1 Department of Informatics, University of Oslo, Norway
    2 Centre for Cancer Biomedicine, University of Oslo, Norway
    3 Department of Tumor Biology, Institute for Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, Po Box 4950 Nydalen, 0424 Oslo, Norway
    4 Institute for Medical Informatics, The Norwegian Radium Hospital, Oslo University Hospital, Norway
    5 Department of Mathematics, University of Oslo, Norway
    6 Department of Medical Biology, Faculty of Health Science, University of Tromsø, Norway
    7 Statistics For Innovation, Norwegian Computing Center, Norway
    8 Bioinformatics core facility, Oslo University Hospital and University of Oslo, Norway
    9 Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Norway
    10 Department of Informatics, University of Bergen, Norway
    11 Computational Biology Unit, Uni Computing, Norway
    12 Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Norway

Slides, Vimeo

The immense increase in availability of genomic scale data sets, such as those provided by the ENCODE and Roadmap Epigenomics projects, allows individual researchers to analyze and query relations between genomic tracks at an unprecedented level.

The Genomic HyperBrowser (http://hyperbrowser.uio.no/test) is an open-ended, Galaxy-based web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to e.g. gene regulation, disease association or epigenetic modifications of the genome. A main tool offers a set of 56 descriptive statistics and 20 hypothesis tests on properties of individual tracks and relations between tracks. The HyperBrowser hosts a further 40 purpose-built tools for the broader analysis setting, including tools that support typical needs for generating and customizing genomic track data prior to analysis, as well as tools for visualization and more specialized analysis of genomic tracks.


modENCODE Galaxy: Uniform ChIP-Seq Processing Tools for modENCODE and ENCODE Data

Quang M. Trinh

Quang M. Trinh1, Fei-Yang (Arthur) Jen1, Ziru Zhou1, Kar Ming Chu1, Marc D. Perry1, Ellen Kephart1, Sergio Contrino2, Peter Ruzanov1, Lincoln D. Stein13

  • 1 Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, ON, Canada M5G 0A3.
    2 Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK.
    3 Department of Molecular Genetics, University of Toronto, 1 Kings College Circle, Toronto, ON Canada M5S 1A8.

PDF, PPTX, Vimeo

Funded by the National Institutes of Health, the aim of the modENCODE project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans and D. melanogaster. With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.

In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy, on the public Amazon Cloud. and on the private Bionimbus Cloud for genomic research. In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies. Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.


The Galaxy SlipStream Appliance: Galaxy Made Easy

Anushka Brownley

Anushka Brownley

Slides, Vimeo

IT infrastructure and support can be a bottleneck when trying to analyze data. The Galaxy SlipStream Appliance is designed to reduce the IT and administrative burden of running a production instance of the Galaxy software package and its underlying tools. SlipStream integrates a powerful computational infrastructure and storage system in a desktop server to provide a dedicated resource to quickly analyze data with Galaxy.


1 July: Session 3: Application-specific workflow

Single-cell genomics pipeline: from raw reads to phylogenomics using Galaxy

Lionel Guy

Lionel Guy and Thijs Ettema

Slides, Vimeo

Only a few percent of prokaryotes are cultivable: the vast majority of them remains to be discovered, but recent technological developments in single-cell genomics now grant us access to this unknown diversity. To answer questions about microbial diversity and even how complex life forms emerged, we sample novel, deep-rooting taxa from a wide variety of environments.

Cells extracted from samples are sorted in 384-well plates using fluorescence-activated cell sorting (FACS). They are lysed, and their genome is amplified with multiple displacement amplification (MDA), yielding a single-cell amplified genome (SAG). SAGs representing cells of interest are multiplexed and sequenced with next-generation sequencing. Up to 100 SAGs can be sequenced on one lane of a Illumina HiSeq2500, and we plan to sequence several thousand SAGs every year.

The Galaxy platform was chosen to handle the vast amount of data generated in the single-cell genomics pipeline. The platform will be integrated in our LIMS system, keeping track of all samples and SAGs present in our lab. Three pipelines are being designed: the first takes raw reads from the sequencing runs as input, performs quality and contamination checks, assembles SAGs, and outputs scaffolds. The second is dedicated to physical and functional annotation of genomes. The third will perform comparative genomics, assemble sets of orthologous genes from our SAGs and published genomes, and perform phylogenomics on the aligned sequences, yielding high-quality phylogenetic trees.

By pairing single-cell genomics with advanced bioinformatics, we aim to shed some light on the deep roots of the Tree of Life.


A layered genotyping-by-sequencing pipeline using Galaxy

Simon Guest

Rudiger Brauning, Simon Guest, Alan McCulloch, Russell Smithies

Slides, Vimeo

At AgResearch around 300 scientists perform data intensive biological research in the areas of animal and plant performance. A recent milestone was processing our billionth NZ sheep genotype. Genotyping-by-sequencing (GBS) is becoming an increasingly important technique underpinning both applied and pure research. Potential practical benefits include rapid genetic improvement and adaption in agricultural species, while scientific advances such as novel variation discovery, and geographic, cultural and historical mapping of populations via extremely dense molecular markers are made possible. GBS pipelines are both scientifically and computationally challenging, involving both high-throughput, and many sequential steps, some of which (such as alignment of error-prone short reads against high diversity, large genome species ) require the use of High Performance Computing (HPC) resources and approaches such as parallelisation. These HPC approaches necessarily involve the context-sensitive fragmentation of data into smaller packets for distribution across an HPC cluster, and re-assembly of output files before progressing to the next step. Yet it is important that utilisation of HPC resources by GBS workflows is transparent to end users, so that proposed, in-progress and completed work-flows that are to be shared and reviewed by project scientists, present a clear auditable end-to-end view of the data pipeline, without low level HPC clutter. We present a GBS pipeline we have developed using Galaxy workflows, which uses a layered approach, including low-level HPC conditioning of tool commands in such a way that HPC is transparent to users. We outline advantages and disadvantages of our approach and discuss ideas for future improvement.


The Linked2Safety's Galaxy Based Data Analysis Space

Aristos Aristodimou

Aristos Aristodimou, Athos Antoniades, Constantinos Pattichis

  • Department of Computer Science, University of Cyprus, Nicosia

Slides, Vimeo

Linked2Safety (288328) is an FP7 project funded by the European Commission under the area of ICT for health. The vision of the project is to advance clinical practice and accelerate medical research, by providing homogenized access to anonymized aggregated distributed EHRs, and the tools for analyzing such data. The datasets provided to Linked2Safety contain genetic, phenotypic, drug and adverse event related information. The proposed data analysis space uses a Galaxy based platform that allows its users to run analyses on the anonymized aggregated distributed EHRs. The Galaxy tools that can be used are grouped in the following categories: i) quality control, ii) feature selection, iii) single hypothesis testing, iv) data mining, and v) visualization. The Linked2Safety data analysis space supports: i) the automated storing of all of the statistically significant associations and association rules from the analyses performed; and ii) allows knowledge extraction, that can be used as an alerting tool that will provide early identification of adverse events. The proposed data analysis space will initially test more than 300 hypotheses (based on experts’ knowledge and the current literature) on the showcases datasets (diabetes, breast cancer, psychiatric disorders), and will identify the association rules of statistically significant results.


Galaxy as an Integration and Workflow Platform for a Cloud Enabled Bio-medical Image Analysis and Image Processing Toolkit

PiotrSzul.jpg

Piotr Szul, Dadong Wang, Yulia Arzhaeva, Shiping Chen, Alex Khassapov, Neil BurdeI, Timur Gureyev, John Taylor, Tomasz Bednarz

  • CSIRO Commonwealth Scientific and Industrial Research Organisation, Australia

Slides, Vimeo

Cloud Based Image Analysis and Processing Toolbox project being carried out by CSIRO, is to run on the Australian National eResearch Collaboration Tools and Resources (NeCTAR) cloud infrastructure and is designed to give access to biomedical image processing and analysis services integrated within a workflow platform to Australian researchers via remotely accessible user interfaces.

Galaxy was selected as a workflow and integration platform with CloudMan supporting distributed computational capabilities in the cloud environment. Galaxy was extended to support image data types and a number of tools for 2D and 3D image analysis and processing were developed based on the existing CSIRO software packages for quantifying cell features in microscopy, 3D medical imaging and X-ray Computer Tomography.

The presentation explores the adaption of Galaxy to the domain of image processing and visualization as well as showcases Galaxy installation in the NeCTAR cloud.


Representation and statistical analysis of 3D chromatin data in a Galaxy framework

Jonas Paulsen

Jonas Paulsen 1, Tonje G. Lien 2, Geir Kjetil Sandve 3,4, Lars Holden5, Ørnulf Borgan2, Ingrid K. Glad 2 and Eivind Hovig 1,3,6

  • 1 Oslo University Hospital, Section for Medical Informatics, The Norwegian Radium Hospital, P.O. Box 4950, Nydalen, N-0424 Oslo, Norway.
    2 University of Oslo, Department of Mathematics, P.O. Box 1053, Blindern, 0316 Oslo, Norway.
    3 University of Oslo, Department of Informatics, P.O. Box 1080, Blindern, 0316 Oslo, Norway.
    4 Centre for Cancer Biomedicine, Faculty of Medicine, University of Oslo, P.O. Box 4950, Nydalen, 0424 Oslo, Norway.
    5 Statistics for Innovation, Norwegian Computing Center, 0314 Oslo, Norway.
    6 Oslo University Hospital, Institute for Cancer Research, Department of Tumor Biology, The Norwegian Radium Hospital, P.O. Box 4950, Nydalen, N-0424 Oslo, Norway.

Slides, Vimeo

The study of chromatin 3D structure has recently gained much focus due to novel techniques, such as Hi-C and Chia-PET, for detecting genome wide chromatin contacts utilizing next-generation sequencing. Both the representation and analysis of such data is complex, and appropriate tools are presently lacking.

We are developing user-friendly tools for statistical analysis of 3D interaction data in a Galaxy framework, building on existing software components of the Genomic HyperBrowser. Our main focus has been on developing hypothesis tests and descriptive statistics where the user can ask specific questions concerning the spatial arrangement of genomic elements in three dimensions. 

We show examples of spatial co-localization of chromatin states and fusion transcripts, and show how visualization and descriptive statistics can accompany hypothesis testing to gain biological knowledge of the 3D organization of chromatin.


1 July: Session 4

Ion Torrent Semiconductor Sequencing Update

MikeLelivelt/pic.png

Mike Lelivelt1

Ion Torrent has invented the first device—a new semiconductor chip—capable of directly translating chemical signals into digital information. The Ion Personal Genome Machine™ Sequencer, launched in December of 2010, delivered 1000X scalability improvements in its first year of commercial availability. The PGM now can deliver over 2 GB of data using the 318 chip with 400 bp read lengths. Ion Torrent released the Ion Proton™ Sequencer in late 2012. The P1 chip has routinely generates 12 GB of across its 165 million microwells with 200 bp read lengths. Both sequencers generate data for a wide variety of applications include: gene panel sequencing, exome analysis, transcript analysis (include whole message, small message, and targeted message), copy number analysis, 16S analysis, and de novo assembly. A review of software development resources will be provided so any interested developer can integrate into the Ion Torrent analysis pipeline.


2 July: Session 5: Interacting with Galaxy

State of the Galaxy

GalaxyTeam/anton.jpg GalaxyTeam/james.jpg

Anton Nekrutenko1 and James Taylor2

Slides, Vimeo

An overview of where the Galaxy Project is and where it is going.


BioBlend - automating bioinformatics with Galaxy and CloudMan

Clare Sloggett1, Nuwan Goonasekera1,2,4, Enis Afgan1,3,4

  • 1 Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne
    2 Victorian eResearch Strategic Initiative (VeRSI), University of Melbourne, Melbourne, Australia
    3 Center for Informatics and Computing (CIR), Ruđer Bošković Institute (RBI)
    4 Galaxy Project (http://galaxyproject.org)

Slides, Vimeo

The Galaxy API allows users and administrators to access a rapidly expanding set of Galaxy functionality via REST commands. CloudMan is a cloud-based job runtime platform, which allows researchers to easily provision scalable 'virtual clusters' to run Galaxy and other applications in a cloud computing environment, and which provides its own REST-based API. 

As a part of Australia’s Genomics Virtual Laboratory project, we created the BioBlend library, a unified API in a high-level language (python) that wraps the functionality of both Galaxy and CloudMan APIs. BioBlend encapsulates the underlying REST API of the two applications in a format that is more suitable for programming and thus makes it easier for bioinformaticians to automate end-to-end large-data analysis, from scratch. Because the end result of a data analysis is still available in the Galaxy environment, the resulting pipeline is highly accessible to collaborators. In combination with CloudMan, it is possible to both provision the required infrastructure, and automate complex analyses over large data sets on an as needed basis.

The library is easily installable via PyPi and comes with detailed documentation and example scripts. BioBlend is released under the MIT license. Documentation and installation instructions can be found at http://bioblend.readthedocs.org/, and the source code is available at https://github.com/afgane/bioblend/.


Extension of Galaxy to Utilize Web Services and A Semantic Suggestion Engine

Jessica Kissinger

Alok Dhamanaskar1, Akshay Choche1, Michael E. Cotterell1, Jie Zheng5, Christian Stoeckert Jr.5, Jessica C. Kissinger1;2;3;4 and John A. Miller1;2

1 Department of Computer Science, University of Georgia, Athens, GA 30602
2 Institute of Bioinformatics, University of Georgia, Athens, GA 30602
3 Department of Genetics, University of Georgia, Athens, GA 30602
4 Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA 30602
5 Penn Center for Bioinformatics and Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104

PDF, PPTX, Vimeo

Local installations of Galaxy often make extensive use of Galaxy’s workflow tools but are limited to the use of tools provided by the local Galaxy instance. We have created a community tool that allows users to make use of Web services thus freeing them to run applications or access data provided outside of the local installation. Users can link multiple Web services together with existing Galaxy tools to form workflows for complex bioinformatics tasks. However, this process requires that users select appropriate Web service operations from a multitude of available Web services and then link them together in a way that is input-output compatible. To help Galaxy users navigate these issues, we have developed and deployed a REST service called the Service Suggestion Engine (SSE). The SSE makes use of semantically annotated Web service description documents (WSDL) for SOAP Web Services to help users select suitable operations during the workflow construction process. The SSE provides suggestions for steps in either direction. As a proof of concept we have semantically annotated dozens of Web services and used the SSE to construct workflows. To complete this task, we added numerous terms to the Ontology for Biomedical Investigations (OBI) to create OBI-WS, a bioinformatics Web service ontology. 


GTrack 1.0: Unified data format providing customizable representation and high-speed analysis performance within Galaxy

Sveinung Gundersen

Sveinung Gundersen1, Matúš Kalaš2,3, Osman Abul4, Arnoldo Frigessi5,6, Eivind Hovig1,7,8, and Geir Kjetil Sandve8

  • 1 Department of Tumor Biology, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, 0310 Oslo, Norway.
    2 Computational Biology Unit, Uni Computing, Thormøhlensgate 55, 5008 Bergen, Norway.
    3 Department of Informatics, University of Bergen, Thormøhlensgate 55, 5008 Bergen, Norway.
    4 TOBB University of Economics and Technology, Ankara, Turkey.
    5 Statistics For Innovation, Norwegian Computing Center, 0314 Oslo, Norway.
    6 Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Blindern, 0317 Oslo, Norway.
    7 Institute for Medical Informatics, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, 0310 Oslo, Norway.
    8 Department of Informatics, University of Oslo, Blindern, 0316 Oslo, Norway.

Slides, Vimeo

A host of alternative formats for representing whole-genome datasets, such as WIG, BED, GFF, and BedGraph, are currently in use, complicating analysis and tool development. The need for different formats are driven partly by the need of extra columns for specific content, but also because of differences reflecting the underlying models of the data. We have delineated fifteen different "track types", representing different intrinsic data models, starting from simple types such as "points" and "segments" to more complex types.

GTrack 1.0 (www.gtrack.no) is a recently defined tabular format that can handle data of all fifteen different track types. It supports customizable specification of columns, customizable value types, as well as graph-type data with weights, improving on the built-in "interval" data format in Galaxy. GTrack can represent the same information as standard formats, in addition to supporting extensions and subtype specifications without the need to rewrite parsers. In addition, GTrack can be used for 3D-type datasets, such as Hi-C data, for which no standard formats exist

GTrack is fully supported by The Genomic HyperBrowser (hyperbrowser.uio.no). The parsers and underlying binary storage scheme of the HyperBrowser system has now been extracted to a separate Python library. The library makes use of a vectorized storage scheme based on NumPy objects, which allows C-type analysis performance using the Python language. The library also supports most other common data formats, including conversions between them. We believe GTrack, and the associated binary library, is ideal for use within Galaxy tools as a backbone for high-speed analysis."


2 July: Session 6: Extending Galaxy

Globus Genomics - An Integrated End to End Sequencing Analysis Platform Powered by Globus Online and Galaxy

Ravi Madduri

Paul Dave,1 Ravi Madduri,2 Dina Sulakhe,1 Alex Rodriguez1

  • 1 University of Chicago
    2 Argonne National Laboratory

Slides, Vimeo

In this talk, we will present Globus Genomics which is a robust, scalable, and flexible solution that provides end-to-end research data management for Next Generation Sequencing Analysis powered by Galaxy, Globus Online and Amazon Web Services. Globus Genomics integrates data management capabilities of Globus Online to complement the flexible Galaxy workflow environment and allows users to run this integrated solution at scale on cloud-based elastic computational infrastructure. We will describe some of the challenge areas that were targeted with this approach and discuss various successful implementations by leading researchers at the University of Chicago, University of Washington and Washington University at St. Louis. Qualitative and quantitative benefits will be highlighted along with proposed future directions of the integrated Globus Genomics platform.


Galaxy-P: Beyond Proteomics

JohnChilton

John Chilton1, James Johnson1, Getiria Onsongo1, Ebbing de Jong1, Pratik Jagtap1, Timothy Griffin2

  • 1 University of Minnesota Supercomputing Institute, Minneapolis, Minnesota, USA
    2 University of Minnesota, Minneapolis, Minnesota, USA

Slides, Prezi, Vimeo

Leveraging the Galaxy framework, the Galaxy-P project has resulted in the creation of many novel tools, workflows, and visualization options for mass spectrometry based data analysis - for applications ranging from standard protein identification to emerging fields (e.g. metabolomics, metaproteomics, and proteogenomics). These developments will be outlined, however the core of the talk will be about the inverse. In addressing the specific challenges of protein informatics, we have moved the framework forward in ways that can and have benefited Galaxy applications outside of proteomics. This talk will cover two of the core of these challenges in depth, namely cross-platform job execution and effectively dealing with large collections of files.

Many of the most powerful proteomics applications are Windows only, which posed a real problem with respect to Galaxy integration. Our effort to address this resulted in the LWR - a cross-platform server application actively used by several institutions. We will discuss deploying the LWR, its architecture, and emerging uses.

Galaxy has traditionally been geared toward interacting with a small, fixed number of files concurrently, this contrasts poorly with proteomics where a biological sample may correspond to any number of peak files. We will discuss core framework contributions for a multiple file selection tool widget and improved batch file submissions to workflows, as well as JGalaxy (a rich client for batch downloading files from Galaxy) and multiple file datasets (which allow standard tools and workflows to operate over variable numbers of inputs and sample tracking throughout complicated analyses).


DevOps Ignition to reach Galaxy continuous integration

Olivier Inizan

Michael Loaec, Olivier Inizan, Jonathan Kreplak, Hadi Quesneville

  • Unité de Recherches en Génomique-Info (UR INRA 1164), INRA, Centre de recherche de Versailles

Slides, Vimeo

The DevOps is a software development movement that stresses a close relationship between software developers and netsys admins. The goal is to enhance and speed up the cycle of software production from the creation to the delivery to final users, with special focus on quick resolution of users’ issues. We applied this method to our production process for our Galaxy instance.

We started from an initial situation with 2 teams (software developers and netsys admins), one methodology (Agile) and an infrastructure composed by an HPC cluster, a Galaxy server and a suite of homemade tools. From this initial situation, we merged our 2 teams in one with extended skills on new tools and technologies.

This experience was not straight forward. We met unexpected situations that we will discuss in this presentation :

  • technicals issues : HPC cluster and Galaxy Virtual Machine : communications, configurations and dependencies,
  • organisational issues : Developpers' environement changed, we passed from isolated personnal machine to collective virtual machine, and developpers had to acquire admin system skills.

Galaxy was no longuer an application : build by developpers and installed by a system admin, but now it was almost an appliance, we had to change our way thinking Galaxy and break the fence between developpers and netsys admins.

The experience gave us new perspectives to improve our development and production processes. Hence, we plan to implement practices and concept like continuous integration and software factory. We will present them applied to Galaxy instances.


The Clinical Galaxy: A validated platform initiative

SanjayJoshi

Sanjay Joshi

Slides, Vimeo

With respect to disease, we humans are the manifestation of our rare variants.

As the "clinical effect chasm" engulfs the efforts to understand the "N of 1" trials moving forward, there is a growing need for the re-evaluation of sample sizes in whole genome sequencing and related methods like RNA-seq, ChIP-seq and their upstream validation.

We will present an overview of the requirements to move Galaxy into the Clinical realm.


2 July: Session 7: Exploiting Galaxy

Enhancing the Galaxy Tool Shed

Greg Von Kuster

Greg von Kuster, Daniel Blankenberg

  • Penn State / Galaxy Team

Slides, Vimeo

The Galaxy Tool Shed enables sharing of Galaxy tools, proprietary datatypes, exported Galaxy workflows, data, and more (collectively: “Galaxy Utilities”) across the research community with ease. Tools can be automatically discovered and installed into a local Galaxy environment in real time, and they can easily be deactivated or uninstalled when they are no longer needed. Here, we demonstrate newly developed features of the Galaxy Tool Shed.

Although the Tool Shed has allowed multiple tool versions to be installed at a single time, the Tool Shed now simplifies the process of ensuring that underlying 3rd party tool dependencies are met by providing the option of automated download and installation of underlying dependencies.

A complex repository dependency system has also been implemented. This system allows a Tool Shed repository to depend on any number of other Tool Shed repositories. This powerful feature has formed the basis for the continuing development of best practices for designing Galaxy Tool Shed Utilities.

While the GUI for installing Tool Shed utilities continues to be improved, a RESTful API has been developed to allow automatic scripted installation, greatly simplifying the process of bootstrapping a new or existing Galaxy instance with many individual tool suites.

To ensure the quality of the Tool Shed and the available contributed Utilities, new testing frameworks have been developed for not only testing the Tool Shed feature-set, but also automatic functional verification and testing of community contributions. Additionally, the Intergalactic Utilities Commission (IUC) has been established to provide expert feedback on community contributions.


How to Create Your Own Web-based, Interactive Visualizations for Galaxy

CarlEberhard

Carl Eberhard1, Jeremy Goecks1, The Galaxy Team1,2, Anton Nekrutenko2, and James Taylor1

  • 1 Emory University
    2 Penn State University

Slides, Vimeo

Visualization plays an integral role in scientific investigation. Visualization is useful for viewing large amounts of data simultaneously, observing patterns and outliers amongst data, and communicating findings to others. To make visualization easier and more powerful in Galaxy, we have create a framework for integrating Web-based visualizations into Galaxy. Just as tools can be easily added to Galaxy, visualizations can now be added as well. Visualizations can be easily accessed via an icon in the history panel’s stored datasets. Galaxy visualizations have many advantages: (i) they can be made highly interactive and customizable; (ii) they require no software or data downloads, and (iii) they can be saved, shared via URL, and included in Galaxy Pages — and shared/included visualization are fully interactive.

In this talk, we describe how to create your own visualization for Galaxy. We provide an overview of how to query datasets in Galaxy for both aggregate data as well as individual data points and how to add a data provider for your own data type. Galaxy includes data providers for SAM/BAM, BED, Interval, GFF/GTF, VCF, BedGraph, Wiggle, and BigWig/BigBed. We also discuss Galaxy JavaScript libraries that can be used to create Web-based visualizations. These libraries include support for creating and saving visualizations, for querying and caching data from Galaxy datasets, and for working with Galaxy tools and genome data. Finally, we introduce an XML data format for configuring a visualization to work with Galaxy.


Managing Galaxy's Built-in Data

Dan

Daniel Blankenberg

Slides, Vimeo

Many Galaxy tools are reliant upon having built-in data (e.g. genomic sequences, aligner indexes, etc) available. Although most tools can alternatively make use of data from a user’s history (e.g. a FASTA dataset), doing so often results in a decrease in performance, as e.g. one-off indexes need to be built by the Galaxy tool each time the dataset is used as a source. Unfortunately, until now, the steps required for generating and informing a Galaxy server of the availability of new built-in data has been an error-prone manual process. Here, we demonstrate new Galaxy features that simplify and automate this process.

A new class of Galaxy Utilities, known as Data Managers, has been developed. Data Managers allow an administrator to use the familiar Galaxy tool interface to download or generate the underlying data and automatically populate Galaxy’s internal built-in data registries (i.e. data tables / *.loc files). When a Data Manager finishes processing, new entries are updated (and persisted) in real-time without requiring a server restart. By using a Data Manager, not only does an administrator avoid the common pitfalls associated with manual curation of built-in data, but they also gain the same reproducibility and transparency associated with Galaxy tools.

Data Managers can be defined locally or installed automatically from a Tool Shed; the framework is flexible and is not restricted to genomic data. Administrators can access them interactively, within Workflows, and via the API. Just as with Galaxy tools, Data Manager jobs can be dispatched across existing compute resources.


Contributing to Galaxy

DannonBaker

Dannon Baker1, Nate Coraor2, John Chilton3

  • 1 Emory University
    2 Penn State
    3 University of Minnesota Supercomputing Institute

Slides, Vimeo

Galaxy is an open platform for data intensive biomedical research, utilized in many diverse environments. The core team has lots of hands-on experience with the instance at usegalaxy.org, a very large public-facing resource, but that it is only one particular environment and local administrators can have a significantly different set of requirements to address to satisfy local users. As the Galaxy Project grows, contributions from the community will be an increasingly important resource for helping continue to move forward and innovate while remaining a stable platform that users can count on. Anyone can get a copy of Galaxy and modify the framework to suit their needs; if those changes enhance the utility of the Galaxy framework for the community at large, whether they’re a bugfix or a new feature, then it’s incredibly valuable to be able to pull the changes back into the core Galaxy framework to share with others. The primary mechanism for doing this is by issuing a ‘pull request’ in Bitbucket that allows the team to review and merge your changes, and we’ll discuss how and why to create them. It is important to make sure that changes to the framework are both useful and usable to the community as a whole, and to realize that not everything is suitable for inclusion.

In this talk we’ll cover why community contributions are important, highlight significant contributions from past years, and discuss how to get involved with Galaxy development and contribute back to the core framework.


Poster Abstracts

Contents

  1. P1: Towards Large-Scale Language Analysis in the Cloud
  2. P2: Cloud-based Image Analysis and Processing Toolbox
  3. P3: BioBlend - automating bioinformatics with Galaxy and CloudMan
  4. P4: Comparing R-based methods and Cuffdiff2 for analysis of RNA-seq data in Galaxy
  5. P5: Comparison of short read aligners with Galaxy
  6. P6: GigaGalaxy: A GigaSolution for reproducible and sustainable genomic data publication and analysis
  7. P7: Engaging Galaxy in Microbiology
  8. P8: Microbiome profiling on a Galaxy-based framework for Microbiology
  9. P9: Control Free Tumour Analysis with Galaxy
  10. P10: Identification and Epidemiological Surveillance of Bacteria: Web System Development and Evaluation of Intelligent Methods
  11. P11: Running on HPC Galaxy-based workflows for predictive biomarkers from RNA-Seq clinical data
  12. P12: Developing a Web-Based Tool for Analysing Cell Type-Specificity of Genomic Variation Data
  13. P13: Gene Regulatory Network Inference and Analysis using Galaxy
  14. P14: Tools for Genome-wide Analyses of Genomic Divergence
  15. P15: Validation setup for cost-efficient RNA-sequencing of pooled samples
  16. P16: Using Frequent Itemset Mining to Find Sets of Co-Occurring Genomic Tracks
  17. P17: CRAC: A new software based on a combinatorial and integrated approach to analyse RNA-seq reads
  18. P18: Detection of Copy Number Alterations (CNAs) in Paired Exome Sequence Data Sets of Acute Myeloid Leukemia (AML) Patients Using Galaxy
  19. P19: Development of a Moroccan Database for Cancer Care (MD2C)
  20. P20: Toward a French cyber-galaxy?
  21. P21: The Galaxy service pilot in CSIRO – a collaboration between science and IT
  22. P22: Andromeda: NBIC Galaxy at Surfsara's HPC cloud
  23. P23: Implementing next generation web server in Galaxy
  24. P24: Leveraging Canadian Bioinformatics with Galaxy VZ in a HPC center
  25. P25: LiSIs: a Galaxy-based platform for Life Science Informatics Research
  26. P26: LifePortal – the Galaxy based portal for life science at University of Oslo

Odd numbered abstracts will be presented on Monday, 1 July from 14:55 to 16:10.

Even numbered abstract will be presented on Tuesday, 2 July from 14:35 to 15:50.

P1: Towards Large-Scale Language Analysis in the Cloud

Emanuele Lapponi1, Erik Velldal1, Nikolay Vasov2, Stephan Oepen2

  • 1 University of Oslo
    2 USIT

Poster

The Language Analysis Portal (LAP) is a Galaxy-based system that is currently being developed in the context of CLARINO, the Norwegian chapter of the pan-European CLARIN initiative. CLARIN aims at establishing a shared research infrastructure for language technology (LT) that ensures easy access to persistent and interoperable resources and services. Although LAP aims to reach out to a diverse set of user groups, it particularly will facilitate use of language analysis in the social sciences, humanities, and other fields without strong computational traditions. While the development of the portal is still in its early stages, this poster presentation documents ongoing work towards an already operable pilot, providing an overview of the challenges of adapting Galaxy to another domain in terms of UI, interchange formats, tool-adaptation and scalability. The work is carried out at the University of Oslo (UiO) as a joint effort by the Language Technology Group (LTG) and the Research Computing group at the University Center for Information Technology (USIT).


P2: Cloud-based Image Analysis and Processing Toolbox

Tomasz Bednarz, Yulia Arzhaeva, Piotr Szul, Alex Khassapov, Neil Burdett, Dadong Wang, Shiping Chen, Darren Thompson, Tim Gureyev, John Taylor 

Poster

Cloud-based Image Analysis and Processing Toolbox project runs on the Australian National eResearch Collaboration Tools and Resources (NeCTAR) cloud infrastructure and allows access to biomedical image processing and analysis services to researchers via remotely accessible user interfaces. The toolbox is based on software packages and libraries developed over the last 10-15 years by CSIRO scientists and software engineers and include functionality: (a) automating process of quantifying cell features in microscopy images; (b) a 3D medical imaging analysis and visualisation platform popular with researchers and medical specialists working with MRI, PET and (c) advanced X-ray image analysis and Computed Tomography. The Galaxy is used a glue to link various imaging functions into fully functional Virtual Laboratory. By providing user-friendly access to cloud computing resources and new workflow-based interfaces, our solution will enable the researchers to carry out various challenging image analysis and reconstruction tasks that are currently impossible or impractical due to the limitations of the existing interfaces. Several case studies will be presented at the conference

Links:


P3: BioBlend - automating bioinformatics with Galaxy and CloudMan

Clare Sloggett1, Nuwan Goonasekera1,2,4, Enis Afgan1,3,4

  • 1 Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne
    2 Victorian eResearch Strategic Initiative (VeRSI), University of Melbourne, Melbourne, Australia
    3 Center for Informatics and Computing (CIR), Ruđer Bošković Institute (RBI)
    4 Galaxy Project (http://galaxyproject.org)

Poster

The Galaxy API allows users and administrators to access a rapidly expanding set of Galaxy functionality via REST commands. CloudMan is a cloud-based job runtime platform, which allows researchers to easily provision scalable 'virtual clusters' to run Galaxy and other applications in a cloud computing environment, and which provides its own REST-based API. 

As a part of Australia’s Genomics Virtual Laboratory project, we created the BioBlend library, a unified API in a high-level language (python) that wraps the functionality of both Galaxy and CloudMan APIs. BioBlend encapsulates the underlying REST API of the two applications in a format that is more suitable for programming and thus makes it easier for bioinformaticians to automate end-to-end large-data analysis, from scratch. Because the end result of a data analysis is still available in the Galaxy environment, the resulting pipeline is highly accessible to collaborators. In combination with CloudMan, it is possible to both provision the required infrastructure, and automate complex analyses over large data sets on an as needed basis.

The library is easily installable via PyPi and comes with detailed documentation and example scripts. BioBlend is released under the MIT license. Documentation and installation instructions can be found at http://bioblend.readthedocs.org/, and the source code is available at https://github.com/afgane/bioblend/.


P4: Comparing R-based methods and Cuffdiff2 for analysis of RNA-seq data in Galaxy

René Böttcher1,4, Saskia Hiltemann1,2, Bram Stoker2, A. Marije Hoogland3, Leon Mei5, G.J.L.H. van Leenders3, Peter Beyerlein 4, Andrew Stubbs2, Guido Jenster1

  • 1 Dept. of Urology, Josephine Nefkens Institute, Erasmus MC, Rotterdam, The Netherlands
    2 Dept. of Bioinformatics, Josephine Nefkens Institute, Erasmus MC, Rotterdam, The Netherlands
    3 Dept. of Pathology, Josephine Nefkens Institute, Erasmus MC, Rotterdam, The Netherlands
    4 Dept. of Bioinformatics, Technical University of Applied Sciences Wildau, Wildau, Germany
    5 Bioassist, Netherlands Bioinformatics Center (NBIC), Nijmegen, The Netherlands

Poster

Background:

Differential expression (DE) and differential exon usage (DEU) in RNA-seq data are commonly investigated by Cufflinks and Cuffdiff at the moment. However, previous work demonstrated that Cuffdiff, prior to version 2, does not capture the biological variation between groups containing many replicates. Therefore, we set out to implement two R-based methods (edgeR and DEXSeq) in Galaxy and to compare their performance with Cuffdiff2.

Methods:

We implemented two workflows based on HTSeq-count (v. 0.5.4p1) as well as edgeR (v. 3.0.4) and DEXSeq (v. 1.4) in our Galaxy environment. After conducting a DE and DEU analysis using default settings for a prostate cancer data sets with 9 samples per condition, we evaluated the results of both R-based methods and Cuffdiff (v. 2.0.2). Results: We observed that Cuffdiff version 2.0.2 shows a distribution of p-values, which depends on the number of samples per condition. When using 9 biological replicates per condition, Cuffdiff does not report any significant genes. In contrast, edgeR and DEXSeq both are able to model increased variance and provide significant results (e.g. 230 genes DE, FDR < 0.05 and 8 genes with DEU, adj. P-value < 0.1) that can be validated subsequently.

Conclusion:

Our Galaxy implementations of edgeR and DEXSeq workflows provide an accurate high-throughput analysis and performance comparisons of different RNA-seq tools in Galaxy. Since Cuffdiff is under active development, we expect an improved release targeting the issues described above. Until then, we recommend to adapt the RNA-seq workflow depending on the number of biological replicates per group.


P5: Comparison of short read aligners with Galaxy

Subazini Thankaswamy Kosalai, Jens Nielsen, Intawat Nookaew

  • Systems and Synthetic Biology, Department of Chemical and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden 41296.

Poster

The emergence of Next generation sequencing (NGS) technology ensued production of large-scale data in fast pace demanding increased storage resource and computational power. The essential step in NGS analysis is read alignment or mapping with reference genome to determine the desired DNA sequence. The genetic difference between strains attained on mapping can also be used in variant detection and annotation. It is difficult to determine the position of short reads by mapping, mostly in the case of repetitive regions. Many tools developed for short read sequence alignment are available public and mostly command-line. On the other hand end-users find it more convenient when the tools are with user-interface. Galaxy is an integrated frame, which can be used in resolving computational issues, by allowing the tools to be deployed in cloud called Galaxy CloudMan. It also allows user to create a well-defined user-interface for command-line tools in XML. In this work, we have deployed different mappers or aligners based on different algorithms in Galaxy CloudMan and compared them for sensitivity and speed with allowed mismatch. XML Wrapper files are generated to create user-defined interface for the command-line mappers and deployed in galaxy so that it can be utilized for constructing workflows. The challenge is to select a mapping tool with fundamental priorities of speed, sensitivity and minimal memory usage. We made criteria for setting different parameters suitable for researchers’ project and evaluated the aligners using mapping speed, RAM occupancy, sensitivity and accuracy using short read simulators and some real data.


P6: GigaGalaxy: A GigaSolution for reproducible and sustainable genomic data publication and analysis

Scott Edmunds1,2, Peter Li1,2, Huayan Gao3,4, Ruibang Luo2, Dennis Chan1, Alex Wong1, Zhang Yong2, Tin-Lap Lee3,4

  • 1 BGI-Hong Kong Ltd., 16 Dai Fu Street, Tai Po Industrial Estate, NT, Hong Kong SAR, China.
    2 BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen, China.
    3 School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
    4 CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.

DOI: 10.6084/m9.figshare.713512, PDF, PPTX

Today's next generation sequencing (NGS) experiments generate substantially more data and are more broadly applicable to previous high-throughput genomic assays. Despite the plummeting costs of sequencing, downstream data processing and analysis create financial and bioinformatics challenges for many biomedical scientists. It is therefore important to make NGS data interpretation as accessible as data generation. GigaGalaxy (http://gigagalaxy.net/) represents a NGS data interpretation solution towards the big sequencing data challenge. We have ported the popular Short Oligonucleotide Analysis Package (http://soap.genomics.org.cn) as well as supporting tools such as Contiguator2 (http://contiguator.sourceforge.net) into the Galaxy framework, to provide seamless NGS mapping, de novo assembly, NGS data format conversion and sequence alignment visualization. Our vision is to create an open publication, review and analysis environment by integrating GigaGalaxy into the publication platform at GigaScience and its !GigaDB database that links to more than 17 Tetrabytes of genomic data. We have begun this effort by re-implementing the data procedures described by Luo et al., (!GigaScience 1: 18, 2012) as Galaxy workflows so that they can be shared in a manner which can be visualized and executed in GigaGalaxy. We hope to revolutionize the publication model with the aim of executable publications, where data analyses can be reproduced and reused.


P7: Engaging Galaxy in Microbiology

http://www.crs4.it/crs4/peopledetails/people/195/Gianmauro_Cuccuru

Massimiliano Orsini, Gianmauro Cuccuru, Nicola Soranzo, Andrea Pinna, Andrea Sbardellati, Antonella Travaglione, Paolo Uva, Gianluigi Zanetti, Giorgio Fotia

  • CRS4, Pula, Sardegna, Italy

Poster

Next Generation Sequencing is today widely applied in both microbiology and metagenomics areas for research and diagnostic applications. The setup of the complete workflow to perform downstream analysis requires a significant effort to integrate software and data for each of the post sequencing steps. While many of the necessary tools are already available in Galaxy, there is currently a lack of a specialized framework in this area. To fill the gap, we developed Orione, a Galaxy based web server for microbiology. Orione include all post mapping or assembling steps from scaffolding to complete annotation pipelines, which have been grouped into appropriate sections to facilitate navigation. We started on selecting the relevant software in the microbiology area, developing then all the necessary tools to integrate them into the Galaxy ecosystem. In addition to that, we made available several specialized workflows covering major applications such as bacterial resequencing, de novo assembly, scaffolding, bacterial RNA-seq, gene annotation and metagenomics. Orione provides additional capabilities to perform integrative, reproducible and transparent bioinformatic data analysis in microbiology thus expanding the constellation of specialized Galaxy based web servers as Nebula, Cistrome and several others. Orione is available at http://orione.crs4.it


P8: Microbiome profiling on a Galaxy-based framework for Microbiology

http://www.crs4.it/web/bioinformatics/peopledetails/people/198/Nicola_Soranzo

Stefano Onano, Gianmauro Cuccuru, Massimiliano Orsini, Andrea Pinna, Andrea Sbardellati, Nicola Soranzo, Paolo Uva, Giorgio Fotia

  • CRS4, Pula, Sardegna, Italy

Poster

Gut microbiome composition has been strongly related to different health status or pathologies, from metabolic disorders to chronic inflammatory syndromes or neoplastic diseases. Currently, NGS approach allows deep investigation of the microbial community, thus helping in elucidating the role of each microbiome component. Metagenomics downstream analysis plays a central role in this context, where millions of sequences are aligned against thousands of genomes, and different algorithms or settings can lead to different results. In order to create an environment for metagenomics analysis and to allow data and results sharing among collaborators, we exploited Orione, a web based framework for microbiology developed at CRS4 (http://orione.crs4.it/). Orione integrates several tools and pipelines focusing on different aspects of metagenomics analysis, from the pre-processing to the reads binning and community composition reconstruction. With the purpose of demonstrating the capabilities of the Orione framework for the management and analysis of metagenomics data, we illustrate a case study in which we compare in an easy and reliable way several approaches for the analysis of the human gut microbiome and an artificial microbiome.


P9: Control Free Tumour Analysis with Galaxy

Saskia Hiltemann1,2, Hailing Mei3, Mattias de Hollander3,4, Peter van der Spek2, Guido Jenster1 and Andrew Stubbs2

  • 1 Department of Urology, Josephine Nefkens Institute, Erasmus University Medical Center, Rotterdam, The Netherlands
    2 Department of Bioinformatics, Erasmus University Medical Center, Rotterdam, The Netherlands
    3 Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
    4 Netherlands Institute for Ecology, Wageningen, The Netherlands

Poster

The first step in tumour analysis is typically a correction with a normal sample, taken from healthy tissue of the same individual. The majority of variants (80%-95%) found in a tumour sample are germline mutations also found in the healthy tissue. When such an associated normal sample is not available, a different filtering method must be employed. Because the majority of variants found in an individual are common throughout the population, we have constructed a set of 85 samples from healthy, unrelated individuals, to act as a “virtual normal”.

We tested our "virtual normal" somatic variant detection approach on two public breast cancer datasets, and two in-house prostate cancer samples, both sequenced on the Complete Genomics platform. We compared the results of this analysis to a standard tumour/normal analysis to detect somatic variations for both structural variations (SVs) as well as SNVs and small indels and substitutions. In addition, the results for both analyses, were filtered for variants present in several databases of human variation, including the 1000 Genomes project, dbSNP, and the Exome Variant Server. We have implemented the tools used for this data analysis in a user friendly, Galaxy, which is deployed in CLOUD environment, to allow for instant scale-up and provide resources for large experimental studies from translational research scientists.

Our "virtual normal" method was able to remove up to 97% of the variants also filtered out using the tumour/normal approach, as well as remove a large number (approx. 150,000 small variants and 100 SVs) of additional variants which are not removed when using only the matched normal sample and the public variant databases. Our results suggest that this “virtual normal” approach can act as a substitute for an associated normal sample, eliminating the need to sequence a matching normal sample for every tumour sample.


P10: Identification and Epidemiological Surveillance of Bacteria: Web System Development and Evaluation of Intelligent Methods

Mansoldo, Felipe Raposo Passos de; Vellasco, Marley Maria Bernardes Rebuzzi (Advisor)

  • 1 Departamento de Engenharia Elétrica, Pontifícia Universidade Católica do Rio de Janeiro

Poster

We developed of a web system called BCIWeb (Bacterial Classifiation and Identification for Web) that could assist in bacterial identification and provide the technology necessary for the administration and control of clinical specimen coming from the hospitals and the discovery of knowledge in database system, through data mining methods using SOM (Self Organizing Maps) and Multilayer Perceptron Neural Networks (MLP) for classification and identificatin of bacteria.

In most laboratories the administration and control of the samples are made manualy through many forms of data sheets, when the samples of biological materials are gathered at the hospital, up to the final identification at the laboratory. In this context, the organization of the information become very limited, its almost impossible to extract useful knowledge, which could help not only supporting decisions but also in the formulations of simple statistics.

It’s worth mentioning that the system developed is a generic one . It can be easily adapted to be used by other areas. It has a web platform, friendly interface, multi-user support, can be configured for all classes of bacterias and it is easy to be used by any kind of web browser. Access is possible by any type of computer, with various operating systems, cells and tablets.

From the development of this friendly tool, in the case study, the historical data from of UERJ Biology Department were entered into the system. The proposed intelligent methods for classification and identification of bacteria were analysed and showed promising results.


P11: Running on HPC Galaxy-based workflows for predictive biomarkers from RNA-Seq clinical data

Calogero Zarbo

Calogero Zarbo, Marco Chierici, Cesare Furlanello

  • Fondazione Bruno Kessler, Trento, Italy

Poster

We present a Galaxy-based framework for clinical diagnostic on big datasets of RNA deep-sequencing (RNA-Seq) data. The framework implements a complete Data Analysis Plan (DAP), integrating state-of-the-art RNA-Seq analysis pipelines with machine learning methods for predictive biomarker selection. Here we discuss in details a Galaxy workflow for the identification of predictive biomarkers from RNA-Seq data, including the comparison with paired microarray data. Our solution extends functions from the paramiko v1.7.5 module in order to transport the Galaxy workflow processes through a virtual bash shell, by an SSH data stream connection, on a high performance computing (HPC) system, e.g. a Linux cluster with the SGE queue system. The goal is to achieve parallelization with one workflow, keeping the same flexibility of a direct interaction with the SGE. The solution provides functions for importing data in the HPC resource, building run-time the entire SGE call, controlling process status and exporting results (datasets) back to a Galaxy host. In particular, the status control methods are mirrored into native standard communication streams in the Galaxy host, thus enabling the rich functionalities already existing in Galaxy, like job status, bug report, etc. DAP components (classifiers, feature weighting, feature stability methods, etc.) are tools of the MLPY Python library, and experiments organized on a 10x 5-fold cross-validation (CV) schema. The workflow runs on the FBK KORE HPC Facility, a Linux cluster consisting of 90 nodes (~1000 cores, 5TB RAM), with tests on different datasets, the largest of 500 samples, within the US FDA-led SEQC international initiative.


P12: Developing a Web-Based Tool for Analysing Cell Type-Specificity of Genomic Variation Data

Kristoffer Waløen

Poster

The majority of trait associated variants found in GWAS studies lie within non coding sequences. This suggests that a large proportion of variants alter regulatory regions. Certain genomic features has been shown useful as marks of cell type specific activity of genomic regions. Analyzing such genomic features against variant regions may therefore be used to find previously unknown links between trait and cell type. Although there have been done several investigations of this type, no easily accessible tools for this type of research exists. This makes reproduction of such results difficult and time consuming, hindering confirmation and updates of such results

Such an accessible tool for studying cell-type specificity of genomic regions is presented here, created in a Galaxy-based web interface at the Genomic HyperBrowser server. It allows the user to run a selection of analyses on their own genomic variation data against genomic tracks of cell-type specific marks. A table presenting the main results provides a broad overview of the most relevant cell types, while links to further details behind each main result allows for deeper investigations.

The tool here presented allows anyone to run such analyses without deep knowledge of statistics and informatics, as most parameters and variables are set automatically by the system. Combined with the graphical interface in the HyperBrowser, this makes it easy to specify and reproduce analyses.


P13: Gene Regulatory Network Inference and Analysis using Galaxy

Alex Upton1, Theo Arvanitis1, Cristin Print2, Daniel Hurley2

  • 1 The University of Birmingham
    2 The University of Auckland

In this work, we present a joint project between The University of Birmingham and The University of Auckland. The goal of this project was to deliver a tool that allows users with limited computer skills to infer and analyse gene regulatory networks from microarray data. Gene regulatory network inference and analysis is an approach for analysing microarray data that has the potential to highlight key genes, and has already resulted in a number of significant biological results in a number of different species. However, widespread use of gene regulatory networks to analyse microarray data is hindered by the specialist programming skills that are required, and also by the variability in implementing these methods between research groups. Biologists are daunted by the prospect of having to learn programming languages such as Matlab, R, and Python. We present a solution using Galaxy. Gene network inference and analysis tools are hosted on Galaxy, that allow the end user to infer and analyse gene regulatory networks from microarray data using a simple web-based interface. Inference is carried out using the widely implemented WGCNA algorithm, and analysis is performed using a number of graph theory metrics. Enrichment analysis and visualisation options are also implemented. This is the first time to our knowledge that gene regulatory inference and analysis tools have been implemented using Galaxy, and it is hoped that this will encourage greater use of gene regulatory networks as a method for analysing microarray data.


P14: Tools for Genome-wide Analyses of Genomic Divergence

Torkil Vederhus

  • University of Oslo

Poster

The recent revolution in genomic sequencing has created new opportunities for exploring the connection between genomic variation and biological traits. By sequencing multiple individual genomes within a species, it is possible to identify genomic regions of divergence between groups of individuals sharing particular phenotypic traits. Such a strategy have in the literature been successfully applied for studies of parallel evolution, but none of these earlier studies have made the underlying methodology and tools readily accessible. It is therefore difficult to reproduce their results or to reuse the methodology for new investigations.

We here present general methodology for identifying divergence between two groups of genomic sequences. One method calculates a cluster separation score based on a two-dimensional scaling of the pairwise differences between individuals of the population. The other method uses the Fisher's exact test score for each single-nucleotide polymorphism found. The tools reproduce earlier published results on parallel evolution in freshwater three-spine sticklebacks and a long-term evolution experiment with common fruit flies.

Both methods are implemented as Galaxy tools in the Genomic HyperBrowser web server. In theory, the tools make it possible for anyone with internet access to perform reproducible analyses identifying regions of genomic divergence between populations. However, the complexity of the methodology and the non-uniformity of formats used to represent the relevant genomic data is a challenge in practice.


P15: Validation setup for cost-efficient RNA-sequencing of pooled samples

Qvist P1,3, Rajkumar AP1-3, Christensen JH1,3, Song H4, Wang, Q4, Borglum AD1-3

  • 1 Department of Biomedicine, Aarhus University, Denmark
    2 Center for Psychiatric Research, Aarhus University hospital, Denmark
    3 The Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH
    4 Bejing Genomic Insitute (BGI)

Poster

Introduction: Sequencing pools of individual RNA samples can reduce the cost of RNA sequencing. However, the validity of such pooling strategy to detect differentially regulated genes remains uncertain. Hence, we aim to validate a RNA sequencing strategy involving pooling of individual RNA samples, derived from brains of genetically modified mice and of their wild genotype littermate controls.

Material and methods: Brains were obtained from 8 wild type and 8 genetically modified mice and sectioned manually in a 1 mm coronal mouse brain matrix. Micro-punches containing amygdala were collected from each section. RNA was extracted from tissue using the Maxwell automated system (Promega) and Quality was assessed using the Agilent 2100 system.

For each genotype 3 groups were formed:

  1. All individual samples separately
  2. Pool of 8 samples
  3. Pool of 3 samples.

TruSeq libraries were constructed for individual samples and pools following polyA enrichment. Libraries were sequenced on an Illumina HiSeq 2000 platform with 50bp SE sequencing.

Analysis: For all samples

  • Data filtering including removal of adaptors, contamination and low-quality reads from raw reads
  • Assessment of sequencing(Statistics of raw reads, Sequencing saturation analysis, analysis of the distribution of reads on reference gene)
  • Gene expression annotation

Between genotypes for all groups:

  • Differential gene expression analysis (Screening of differentially expressed genes (DEGs), and experimental repeatability analysis of DEGs)

Between groups:

  • Comparison of DEGs (DEGs detected in Group 1 as reference)


P16: Using Frequent Itemset Mining to Find Sets of Co-Occurring Genomic Tracks

Boris Simovski, Geir Kjetil Sandve

Poster

While immense amounts of genomic data are now publicly available, analyzing the data is a complicated and at times resource exhaustive task. A well established analysis is the computation of pairwise overlap between two genomic tracks. However, in certain situations it is valuable to consider a larger number of genomic tracks and e.g. discover subsets of the tracks that occur together at the same locations along the genome. An example of such a problem is to find combinations of transcription factor (TF) ChiP-seq tracks that occur at the same locations in the genome, either from a set of tracks for different TFs or from a set of tracks for the same TF in different cells/settings.

The problem at hand can be translated into a more general problem within the field of data mining, called frequent itemset mining. According to the itemset mining terminology, we take the genomic tracks to represent items and the base-pair positions of the genome to represent transactions.

We present a Galaxy-based web tool at the Genomic HyperBrowser web server that enables the user to run frequent itemset mining on large sets of genomic tracks. The result is a list of track combinations that occur together on at least a minimum number of base pairs along the genome. We present results for two different approaches, based on the breadth-first Apriori and the depth-first Eclat algorithm. We discuss their advantages and drawbacks, as well as the general usefulness of applying itemset mining to the analysis of genomic tracks.


P17: CRAC: A new software based on a combinatorial and integrated approach to analyse RNA-seq reads

Nicolas. Philippe1,3,4, Mikael Salson2, Alban Mancheron3,4, Thérèse Commes1,4, Eric Rivals3,4

  • 1 Institut de Recherche en Biothérapie, INSERM U1040, France
    2 Laboratoire d'Informatique Fondamentale de Lille, France
    3 Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, France
    4 Institut de Biologie Computationnelle, Montpellier France

Poster

The comprehensive analysis of expression profiles based on RNA-seq provides accuracy and unprecedented sensitivity for exploring transcriptome in all its complexity. This method is particularly suited to open the discovery of new transcripts (new variants, non-coding RNAs and RNA chimeras). The difficulty in the analysis lies in the ability to detect and extract rigorously the biological information from RNA-seq data. Indeed, the splicing process, which generates both co-linear and non co-linear RNAs, the inclusion of sequencing errors, somatic mutations, polymorphisms, and rearrangements make the reads differ from the reference genome in a variety of ways. This complicates the task of comparing reads with a genome. We have developed a new tool, called CRAC, for exploring the whole transcriptional repertoire (Philippe et al., 2013) based on an innovative algorithm. The main idea is to adopt a k-mer approach that combines the genomic positions and local coverage to perform a complex analysis of each read and detect in a single step, mutations, indels, errors, as well as both normal and chimeric splice junctions. For biological applications, one of the advantages using CRAC is its ability to characterize the presence of new splice junctions and RNA chimeras in tumors. CRAC is a fully operational open source software, which is more efficient than the tools currently used in the field. CRAC is available at http://crac.gforge.inria.fr. The ATGC platform, part of the ReNaBi and France Genomique bioinformatics network, now provides its own new Galaxy service to access a NGS tools range that includes crac.


P18: Detection of Copy Number Alterations (CNAs) in Paired Exome Sequence Data Sets of Acute Myeloid Leukemia (AML) Patients Using Galaxy

S. Vosberg1,2, T. Herold2, N. Sandhöfer1,2, G. Göhring3, A. Graf4, S. Krebs4, H. Blum4, B. Schlegelberger3, K. Spiekermann1,2, S.K. Bohlander1,2,5, P.A. Greif1,2

  • 1 Clinical Cooperative Group Leukemia, German Research Centre for Environmental Health, Munich, Germany
    2 Department of Internal Medicine 3, Ludwig-Maximilians-University, Munich, Germany
    3 Institute of Cell and Molecular Pathology, Hannover Medical School, Hannover, Germany
    4 Laboratory for Functional Genome Analysis, Gene Center, Ludwig-Maximilians-University, Munich, Germany
    5 Center for Human Genetics, Philipps University, Marburg, Germany

Poster

Beyond the identification of SNVs and Indels, exome sequencing allows to detect somatic Copy Number Alterations (CNAs) in protein coding regions of tumor DNA. Using the Galaxy platform, we analyzed the read depth of tumor and normal control exome sequence data sets from acute myeloid leukemia (AML) patients confirming unbalanced translocations, aneuploidy and complex karyotypes.

Mean exon coverages were determined for both samples and a linear regression model was used to describe the tumor sample coverage as a linear function of the healthy sample coverage. This approach may handle regions of zero coverage, monoallelic deletions and tolerates outliers. The maximum-likelihood segmentation, defined by an exact algorithm using a Bayesian Information Criterion adapted for segmentation problems, separates regions of equivalent exon coverage between tumor and control samples from regions of aberrant exon coverage.

Tumor samples with trisomy 13 show significantly enriched exon coverages of chromosome 13 compared to the control sample. An MLL/AF9 rearrangement with partial loss of the 3'-part of the MLL gene, showed significantly reduced coverages in the tumor sample at exons downstream of the breakpoint in the MLL gene. A complex karyotype including the unbalanced translocations t(1;2), t(5;17), t(8;11) and a monosomy 7 resulted in corresponding CNAs on chromosomes 1, 5, 7, 11 and 17.

Our study demonstrates that somatic CNAs in tumor cells can be identified by exome sequencing of tumor and control samples. Furthermore, this approach might be able to detect novel tumor-specific CNAs in protein coding regions contributing to the onset and progression of AML.


P19: Development of a Moroccan Database for Cancer Care (MD2C)

Oussama Semlali1, Adil El Yamine1, Fadoua Haoudi1, Housna Arrouchi1, Ahmed Moussa2, Azeddine Ibrahimi1

  • 1 MedBiotech (Research Equipe of Medical Biotechnology), Pharmacology and Toxicology Laboratory, Rabat - Faculty of Medecine & Pharmacy of Rabat, UM5S, Morocco
    2 Innovative Technologies Laboratory, ENSAT, Abdelmalek Essaadi University, Tangier, Morocco

In Morocco women's Breast Cancer constitutes a major public health problem. According to the Central Cancer Registry RCCR, the disease’s incidence increased during the period of three years to 39.9 new cases per 100.000 women. Breast cancer is a heterogeneous disease with different morphologies, molecular profiles, clinical behavior and disparate response to therapy. However, the increasing understanding of molecular carcinogenesis has begun to change paradigms in oncology from traditional single-factor strategy to a multi-parameter systematic strategy. The classic therapeutic model for breast cancer treatment has changed from adopting radical surgery, conservative surgery, radiotherapy, chemotherapy and hormonotherapy to more personalized strategy.

In this paper, we describe the development of the Moroccan Database for Cancer Care (MD2C). As a first step this platform will integrate all the information relevant to Moroccan breast cancer patients in a database. A query interface is developed using open source technologies, allowing easy secure access to the breast cancer database. The second step is to generate experts systems to assist in decision making. Our MD2C database includes all patient’s personal and socio-economical data, family and personal disease history, clinical and paraclinical diagnosis, genetic and genomic data. This work, and during all the development phases, was done by our bioinformatics team in a multidisciplinary setting including oncologists, pathologists and pharmacists. This database will help Moroccan doctors in making precise decisions concerning risks, diagnosis and therapeutic protocols to use and will allow us to extract of knowledge to generate the first Moroccan breast cancer therapeutic model.


P20: Toward a French cyber-galaxy?

Cristophe CARON1, Wilfrid CARRE1, Alexandre CORMIER1, Sandra DEROZIER2, Franck GIACOMONI3, Olivier INIZAN4, Gildas LE CORGUILLE1, Alban LERMINE5, Sarah MAMAN6, Pierre PERICARD1 and Franck SAMSON2

  • 1 CNRS, UPMC, FR2424,ABiMS, Station Biologique, 29680, Roscoff, France
    2 INRA, UR1077, MIGALE, Centre de Jouy-en-Josas, 78352, Jouy-en-Josas, France
    3 PFEM, UMR1019 INRA, Centre Clermont-Ferrand-Theix, 63122, Saint Genes Champanelle, France
    4 INRA, UR1164, Route de St Cyr, Versailles, France
    5 Institut Curie, INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, 75248 Paris, France
    6 INRA, UMR444, Laboratoire de Génétique Cellulaire, Centre de Toulouse Auzeville, 24 Chemin de Bordé Rouge, 31320 Auzeville-Tolosane, France

Poster

The success of the open web based platform “Galaxy” is growing among scientific communities. The French Institute of Bioinformatics (IFB) wishes to initiate a collaborative work dedicated to scientific workflows and especially to the Galaxy platform. We report here the main items on which future collaborations could be build: (i) software and hardware architecture, (ii) tools integration and (iii) training.

High throughput technologies advent significantly alters analysis behaviour and strategy with mobilization of new infrastructure, new tools and new skills. IFB decided to conduct a cross action on "workflows" data analysis solutions, and especially on the Galaxy platform. The first item called "software and hardware architecture" addresses the operational issues in production environments, the potential for automating deployment tasks and the monitoring solutions for Galaxy servers.

With the second one, "Tools integration" we aim to provide processes facilitating tool interfacing in a Galaxy instance. Priority will be the development of a good practice guide, as well as a technology watch around the methods proposed by the international community. We also want to promote the sharing of training activities at national level (such as the Aviesan Bioinformatics school, January 2013 - http://galaxy-ecole.sb-roscoff.fr/) and ensure a smooth transition to new uses, such as E-learning. A first working group is already effective. Previous items will be improved in the coming months thanks to a specific dedicated wiki and the first French Galaxy Workshop this autumn.


P21: The Galaxy service pilot in CSIRO – a collaboration between science and IT

Steve McMahon1, Philippe Moncuquet2, Sean Li2, Ondrej Hlinka2, Josh Bowden1, Sean McWilliam2 and Annette McGrath2

  • 1 Advanced Scientific Computing Team, Information Management & Technology, CSIRO, Canberra, Australia
    2 CSIRO Bioinformatics Core CSIRO, Canberra, Australia

Poster

A Galaxy service pilot was set up in CSIRO for the benefit of biologists and bioinformaticians within the organisation. The Galaxy service pilot was implemented as a collaboration between CSIRO’s Information Management and Technology staff (IM&T) and the CSIRO Bioinformatics Core.

In CSIRO biologists had been relying on a limited number of skilled bioinformaticians to carry out this analysis. It was proposed that a service providing easy access to some analysis tools would improve research throughput of the novice bioinformaticians while freeing up time of the experienced bioinformaticians for other work.

This service pilot project intended to demonstrate how a full Galaxy service might benefit the bioscience community in CSIRO. Lessons learnt from the pilot were intended to guide the design and implementation of a full production Galaxy service. The service pilot delivered over 300 useful bioinformatics tools and focussed on providing a comprehensive set of next gen sequencing analysis tools to enable users to best evaluate the capabilities of a potential Galaxy production service .

The project was successful in that it showed how CSIRO IT and science staff could work together to achieve project goals. The service pilot was made available to users in September 2012 and there are now over 90 registered users and a number of published workflows. The pilot service is being used extensively by some users and feedback has been extremely positive. With the success of the pilot management approval is being sought for an ongoing production service.


P22: Andromeda: NBIC Galaxy at Surfsara's HPC cloud

Mattias de Hollander1, David van Enckevort2, Leon Mei2, Marc van Driel2, Rob Hooft2

  • 1 KNAW-NIOO
    2 NBIC

Poster

Andromeda is a public Galaxy server set up by the Netherlands Bioinformatics Center (NBIC) to support genomics research in the Netherlands. Andromeda has been running over 3 years and was originally intended to be a demonstration server for bioinformatics tools made by NBIC developers. Several application specific pipelines are installed at Andromeda together with common sequencing analysis tools. Andromeda has been used at several NBIC courses to support practicals and has been proven to be an effective platform for knowledge dissemination.

However, the need for processing real scale research datasets at Andromeda was clearly visible already in the beginning. This demand is only becoming more prominent in the past year when more researchers are able to acquire NGS datasets for their project but fail to obtain the necessary bioinformatics support within their groups. 

To support this growing demand, NBIC together with the BigGrid project and SURFsara installed the new Andromeda at a high performance computing cloud system hosted by SURFsara. This HPC cloud consists of 19 fast servers with 608 CPUs and almost 5TB of memory. In order to best use the elastic resource provided by the HPC cloud, the new Andromeda also incorporates the CloudMan script to support dynamic adding and removing of virtual machines based on the number of submitted jobs. Till the beginning of 2013, there are about 700 registered users at Andromeda and almost 40000 jobs have been executed. 

In this presentation, we will present the architecture of Andromeda and its installation and maintenance procedure.


P23: Implementing next generation web server in Galaxy

Wai Yi Leung1, Leon Mei2

  • 1 Leiden University Medical Centre, Sequence Analysis Support Core
    2 NBIC / Leiden University Medical Centre, SASC

Poster

A few institutes brought the galaxy server software to the public which helped in the growth of the Galaxy user community. The userbase for these public servers have grown very fast, creating new challenges to the administrators. Challenges include: traffic handling, data storage, computing facility (cluster), new versions of (optimized) software (tool-shed) and production ready deployments of the webserver and database server.

Focus on performance has shifted from running all analyses in one instance to a local cluster. Optimizations in the tools are dealt with on every major release (e.g. bwa). The question remains: what about the basic matters? The codebase of Galaxy is Pylons based, which ran exclusively with Paste at time of writing Galaxy.

Our interest was to see whether we can push Galaxy to a new limit on serving more request per second. The reason for this is simple: web request are (relatively) not cpu intensive. Web request are mostly database-connection bound and/or filesystem bound because of the web-templates.  We expect a gain in the amount of web-requests when we replace Paste by a modern WSGI server like Gunicorn, Tornado or uWSGI.

An initial setup with Gunicorn show a 200% gain in served request per second and a drop of 70% in memory usage. uWSGI show a comparable profile, though with much  complex configuration. We aim to provide a solution where minimal change is needed to run Galaxy in an optimized environment for production usage.


P24: Leveraging Canadian Bioinformatics with Galaxy VZ in a HPC center

David Anderson de Lima Morais1,2, Carol Gauthier1,2, Michel Barrette1, David Bujold2, Maxime Caron2, Alain Veilleux1, Guillaume Bourque1

Poster

Bioinformatics in Canada is a fast growing science. The need for data analysis and storage has long surpassed what any single lab can accomplish. Moreover, the complexity of some pipelines renders the analysis unfeasible for users not acquainted with programing languages. Using the Mammouth supercomputer, presently the third fastest in Canada, we provided a Galaxy environment for the Canadian scientific community. Our hybrid approach (cloud/HPC) consists of deploying Galaxy on a virtual machine (hosted on the interactive node) in a way that allows for the launching of jobs on Mammouth’s computing nodes, using simple connectors and file system mounts. This approach allows us to use Galaxy in a secure and self-contained environment while benefiting from the full power of the HPC center. Galaxy has been also coupled with our local UCSC browser installation, which allows for fast data integration. We intend not only to provide tools for data analysis but also to serve and maintain a set of common pipelines, which can be easily used by any researcher. We also have a tight collaboration with the Integrative Epigenomic Data Coordination Centre (EDCC), at McGill University, which will enable us to share data and pipelines related with Epigenomics. Ultimately, we want to extend our model to other Canadian HPC centres and deploy Galaxy pipelines using its API through an external metascheduler.


P25: LiSIs: a Galaxy-based platform for Life Science Informatics Research

Kannas Christos C.1, Antoniou Zinonas1, Achilleos Kleo1, Nicolaou Christos A.1, Pattichis Costantinos S.1, Kalvari Ioanna2, Kirmitzoglou Ioannis2, Promponas Vasilis I.2, Savva Christiana2, Nephytou Christiana2, Contantinou Andreas I.2, Scherf David3, Gerhäuser Clarissa3

  • 1 Department of Computer Science, University of Cyprus, Nicosia, Cyprus
    2 Department of Biological Science, University of Cyprus, Nicosia, Cyprus
    3 Cancer Chemoprevention and Epigenomics Workgroup, German Cancer Research Center, Heidelberg, Germany

Poster

In this presentation we introduce the Life Science Informatics (LiSIs) platform, a new, open Scientific Workflow Management Systems (SWMSs), with several unique features designed to enhance user experience and facilitate user adoption. LiSIs is an online system based on the widely popular Galaxy SWMS. LiSIs provides five tool categories dedicated to small molecule virtual screening and, a selection of native Galaxy tools. The tool categories are: (1) Input Layer, offering tools for chemical and biological data file parsing; (2) Pre-Processing Layer, offering tools for compound fingerprint calculation, chemical structure property calculation, compound fragmentation, conformation generation and protein cleaning; (3) Processing Layer, offering numerous tools for chemical property filtering, compound similarity calculation, predictive modelling for biological properties and docking-pose prediction and scoring; (4) Post-Processing Layer, offering tools for converting chemical files formats and merging binary datasets; (5) Output Layer, offering tools for the preparation of files with the results obtained in SMILES, SDF and tabular format.

LiSIs has been used to implement virtual screening workflows for the selection of compounds that may serve as leads for subsequent cancer chemoprevention research. Typically, several thousand commercially available compounds are supplied as input to a workflow and are subjected to a series of computational filters including, for example, drug likeness, predicted potency via predictive models and predicted binding affinity via docking. The results, shared with expert chemopreventive researchers using the LiSIs platform, demonstrate the potential use of the system by users of varying backgrounds and computational experience to advance drug discovery research.


P26: LifePortal – the Galaxy based portal for life science at University of Oslo

Nikolay Vazov Katerina Michalickova George Magklaras

Nikolay Vazov1, Katerina Michalickova1, George Magklaras1,2 Gard Thomassen1, Hans A. Eide1

  • 1 University Center for Information Technology (USIT), University of Oslo
    2 Biotechnology Center of Oslo & Norwegian Center for Molecular Medicine, University of Oslo

Poster

As the demands for simplified and user-centric interfaces to computational resources are increasing, so is the demand for a wider range of applications and tools presented through these interfaces. We selected the Galaxy platform to provide an interface to our high performance computing resources and life sciences software. The production server release for the LifePortal is set for October 1st, 2013. The LifePortal includes services currently provided by a portal for bioinformatics applications - the Bioportal (www.bioportal.uio.no).

Despite successfully hosting several production Galaxy instances on a single server, we had to introduce modifications to the Galaxy distribution to tailor it for our HPC production environment. The adaptations fall into three categories - security, computer cluster job submission and accounting.

The LifePortal will make use of the Norwegian national infrastructure for scientific computing (www.notur.no), specifically the Abel computing cluster at University of Oslo. We are using the Norwegian federated authentication system FEIDE (www.feide.no) to ensure compliance with the terms for usage. We implemented this feature alongside the internal Galaxy user management. Additionally, the Galaxy database has been outsourced to a database hotel using an SSL connection. The LifePortal Galaxy server submits jobs to the Abel compute cluster using the SLURM batch scheduler system (slurm.schedmd.com). This feature provides a user-friendly interface to our high performance computing resource. Since the computing cluster has fixed user quotas, our Galaxy server has to communicate with an external accounting system (www.clusterresources.com/products/gold-allocation-manager.php).



Poster and Talk Abstract Submission are now closed

Abstracts were submitted electronically. Abstracts should be 250 words of plain text or less. Talks and posters on any topics of interest to the Galaxy community are welcome. Areas of interest include, but are not limited to:

  • Best practices for local Galaxy installation and management
  • Integrating tools and/or data sources into the Galaxy framework
  • Deploying galaxy on different infrastructures
  • Compelling or novel uses of Galaxy for biomedical analysis

There will also be an opportunity for lightning talks, which will be solicited at the meeting. 

Submissions

Please Note: By submitting an abstract you agree to:

1. Make your slides and/or poster freely available on this web site, no later than 1 August 2013. 1. Have your talk be videotaped and have that videotape be publicly accessible on the web. (We may or may not have sufficient funds to records talks.)

Special GCC2013 and Galaxy series in GigaScience

GigaScience Journal

Accepted talks are eligible to appear in GigaScience, a new journal co-published in collaboration between BGI Shenzhen and BioMed Central focused on studies utilizing large-scale datasets and workflows. 

See this announcement for details.



← 96 cm →


138 cm

Poster

 

Poster Guidelines

The maximum poster dimensions are 96 cm across by 138 cm high (38 in. x 54 in).

Information on when posters should be posted, presented, and taken down will appear here before the conference.

Timeline

Date

Event/Deadline

22 February

Talk and Poster Abstract submission opened

12 April

Talk Abstract submission closed

26 April

Authors notified of Talk abstract acceptance status

3 May

Poster Abstract submission closed

16 May

Authors notified of Poster abstract acceptance status

1 August

All conference material made available on the conference web pages.

See Key Dates for a complete timeline.

Questions? Contact the Organizers.