#!/usr/bin/perl # $Id: bptutorial.pl,v 1.151 2005/01/14 03:55:14 jason Exp $ =head1 NAME BioPerlTutorial - a tutorial for bioperl =head1 VERSION 1.5 =head1 AUTHOR Written by Peter Schattner Copyright Peter Schattner Contributions, additions and corrections have been made to this document by the following individuals: Jason Stajich Heikki Lehvaslaiho Brian Osborne Hilmar Lapp Chris Dagdigian Elia Stupka Ewan Birney =head1 DESCRIPTION This tutorial includes snippets of code and text from various Bioperl documents including module documentation, example scripts and "t" test scripts. You may distribute this tutorial under the same terms as perl itself. This document is written in Perl POD (plain old documentation) format. You can run this file through your favorite pod translator (pod2html, pod2man, pod2text, etc.) if you would like a more convenient formatting. Table of Contents I. Introduction I.1 Overview I.2 Quick getting started scripts I.3 Software requirements I.3.1 For minimal bioperl installation I.3.2 For complete installation I.4 Installation procedures I.5 Additional comments for non-unix users I.6 Places to look for additional documentation II. Brief overview to bioperl's objects II.1 Sequence objects (Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, RichSeq, SeqWithQuality, SeqI) II.2 Location objects (Simple, Split, Fuzzy) II.3 Interface objects and implementation objects III. Using bioperl III.1 Accessing sequence data from local and remote databases III.1.1 Accessing remote databases (Bio::DB::GenBank, etc) III.1.2 Indexing and accessing local databases (Bio::Index::*, bp_index.pl, bp_fetch.pl) III.2 Transforming formats of database/ file records III.2.1 Transforming sequence files (SeqIO) III.2.2 Transforming alignment files (AlignIO) III.3 Manipulating sequences III.3.1 Manipulating sequence data with Seq methods (Seq) III.3.2 Obtaining basic sequence statistics (SeqStats,SeqWord) III.3.3 Identifying restriction enzyme sites (Bio::Restriction) III.3.4 Identifying amino acid cleavage sites (Sigcleave) III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern III.3.6 Converting coordinate systems (Coordinate::Pair, RelSegment) III.4 Searching for similar sequences III.4.1 Running BLAST remotely (using RemoteBlast.pm) III.4.2 Parsing BLAST and FASTA reports with Search and SearchIO III.4.3 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq III.4.4 Parsing HMM reports (HMMER::Results, SearchIO) III.4.5 Running BLAST locally (StandAloneBlast) III.5 Manipulating sequence alignments (SimpleAlign) III.6 Searching for genes and other structures on genomic DNA (Genscan, Sim4, ESTScan, MZEF, Grail, Genemark, EPCR) III.7 Developing machine readable sequence annotations III.7.1 Representing sequence annotations (SeqFeature,RichSeq,Location) III.7.2 Representing sequence annotations (Annotation::Collection) III.7.3 Representing large sequences (LargeSeq) III.7.4 Representing changing sequences (LiveSeq) III.7.5 Representing related sequences - mutations, polymorphisms (Allele, SeqDiff) III.7.6 Incorporating quality data in sequence annotation (SeqWithQuality) III.7.7 Sequence XML representations - generation and parsing (SeqIO::game) III.7.8 Representing Sequence Features using GFF (Bio:Tools:GFF) III.8 Manipulating clusters of sequences (Cluster, ClusterIO) III.9 Representing non-sequence data in Bioperl: structures, trees, maps, graphics and bibliographic text III.9.1 Using 3D structure objects and reading PDB files (StructureI, Structure::IO) III.9.2 Tree objects and phylogenetic trees (Tree::Tree, TreeIO, PAML.pm ) III.9.3 Map objects for manipulating genetic maps (Map::MapI, MapIO) III.9.4 Bibliographic objects for querying bibliographic databases (Biblio) III.9.5 Graphics objects for representing sequence objects as images (Graphics) III.10 Bioperl alphabets III.10.1 Extended DNA / RNA alphabet III.10.2 Amino Acid alphabet IV. Auxiliary Bioperl Libraries (Bioperl-run, Bioperl-db, etc.) IV.1 Using the Bioperl Auxiliary Libraries IV.2 Running programs (Bioperl-run and Bioperl-ext) IV.2.1 Sequence manipulation using the Bioperl EMBOSS and PISE interfaces IV.2.2 Aligning 2 sequences with Blast using bl2seq and AlignIO IV.2.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm) IV.2.4 Aligning 2 sequences with Smith-Waterman (pSW) IV.3 Bioperl-db and BioSQL IV.4 Other Bioperl auxiliary libraries V. Appendices V.1 Finding out which methods are used by which Bioperl Objects V.2 Tutorial Demo Scripts =head1 I. Introduction =head2 I.1 Overview Bioperl is a collection of perl modules that facilitate the development of perl scripts for bioinformatics applications. As such, it does not include ready to use programs in the sense that many commercial packages and free web-based interfaces do (e.g. Entrez, SRS). On the other hand, bioperl does provide reusable perl modules that facilitate writing perl scripts for sequence manipulation, accessing of databases using a range of data formats and execution and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently, bioperl enables developing scripts that can analyze large quantities of sequence data in ways that are typically difficult or impossible with web based systems. In order to take advantage of bioperl, the user needs a basic understanding of the perl programming language including an understanding of how to use perl references, modules, objects and methods. If these concepts are unfamiliar the user is referred to any of the various introductory or intermediate books on perl. We've liked S. Holzmer's Perl Core Language, Coriolis Technology Press, for example. This tutorial is not intended to teach the fundamentals of perl to those with little or no experience in the perl language. On the other hand, advanced knowledge of perl - such as how to write a object-oriented perl module - is not required for successfully using bioperl. Bioperl is open source software that is still under active development. The advantages of open source software are well known. They include the ability to freely examine and modify source code and exemption from software licensing fees. However, since open source software is typically developed by a large number of volunteer programmers, the resulting code is often not as clearly organized and its user interface not as standardized as in a mature commercial product. In addition, in any project under active development, documentation may not keep up with the development of new features. Consequently the learning curve for actively developed, open source source software is sometimes steep. This tutorial is intended to ease the learning curve for new users of bioperl. To that end the tutorial includes: =over 4 =item * Descriptions of what bioinformatics tasks can be handled with bioperl =item * Directions on where to find the methods to accomplish these tasks within the bioperl package =item * Recommendations on where to go for additional information. =item * A runnable script, bptutorial.pl, which demonstrates many of the capabilities of Bioperl. Runnable example code can also be found in the scripts/ and examples/ directories. Summary descriptions of all of these scripts can be found in the file bioscripts.pod (or http://bioperl.org/Core/Latest/bioscripts.html). In addition, the POD documentation for many Bioperl modules should contain runnable code in the SYNOPSIS section which is meant to illustrate the use of a module and its methods. You will also find some interesting bits of code in the FAQ (http://bioperl.org/Core/Latest/faq.html). =back Running the bptutorial.pl script while going through this tutorial - or better yet, stepping through it with an interactive debugger - is a good way of learning bioperl. The tutorial script is also a good place from which to cut-and-paste code for your scripts (rather than using the code snippets in this tutorial). Most of the scripts in the tutorial script should work on your machine - and if they don't it would probably be a good idea to find out why, before getting too involved with bioperl! Some of the demos require optional modules from the bioperl auxiliary libraries and/or external programs. These demos should be skipped if the demos are run and the required auxiliary programs are not found. =head2 I.2 Quick getting started scripts For newcomers and people who want to quickly evaluate whether this package is worth using in the first place, we have a very simple module which allows easy access to a small number of Bioperl's functionality in an easy to use manner. The Bio::Perl module provides some simple access functions. For example, this script will retrieve a swissprot sequence and write it out in fasta format use Bio::Perl; # this script will only work if you have an internet connection on the # computer you're using, the databases you can get sequences from # are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq' $seq_object = get_sequence('swiss',"ROA1_HUMAN"); write_sequence(">roa1.fasta",'fasta',$seq_object); That second argument, 'fasta', is the sequence format. You can choose among all the formats supported by SeqIO (L<"III.2.1">). Another example is the ability to blast a sequence using the facilities as NCBI. Please be careful not to abuse the resources that NCBI provides and use this only for individual searches. If you want to do a large number of BLAST searches, please download the blast package and install it locally. use Bio::Perl; $seq = get_sequence('swiss',"ROA1_HUMAN"); # uses the default database - nr in this case $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result); Bio::Perl has a number of other easy-to-use functions, including get_sequence - gets a sequence from standard, internet accessible databases read_sequence - reads a sequence from a file read_all_sequences - reads all sequences from a file new_sequence - makes a bioperl sequence just from a string write_sequence - writes a single or an array of sequence to a file translate - provides a translation of a sequence translate_as_string - provides a translation of a sequence, returning back just the sequence as a string blast_sequence - BLASTs a sequence against standard databases at NCBI write_blast - writes a blast report out to a file Using the Bio::Perl.pm module, it is possible to manipulate sequence data in Bioperl without explicitly creating the Seq or SeqIO objects described later in this tutorial. However, only limited data manipulation is supported in this mode. Look at the documentation in L by going 'perldoc Bio::Perl' to learn more about these functions. In all these cases, Bio::Perl accesses a subset of the underlying Bioperl functions (for example, translation in Bioperl can handle many different translation tables and provides different options for stop codon processing) - in most cases, most users will migrate to using the underlying bioperl objects as their sophistication level increases, but Bio::Perl provides an easy on-ramp for newcomers and lazy experts alike. Also see examples/bioperl.pl for more examples of usage of this module. =head2 I.3 Software requirements What's required to run bioperl. =head2 I.3.1 Minimal bioperl installation (Bioperl "core" installation) For a minimal installation of bioperl, you will need to have perl itself installed as well as the bioperl "core modules". Bioperl has been tested primarily using perl 5.005, 5.6, and 5.8. The minimal bioperl installation should still work under perl 5.004. However, as increasing numbers of bioperl objects are using modules from CPAN (see below), problems have been observed for bioperl running under perl 5.004. So if you are having trouble running bioperl under perl 5.004, you should probably upgrade your version of perl. In addition to a current version of perl, the new user of bioperl is encouraged to have access to, and familiarity with, an interactive perl debugger. Bioperl is a large collection of complex interacting software objects. Stepping through a script with an interactive debugger is a very helpful way of seeing what is happening in such a complex software system - especially when the software is not behaving in the way that you expect. The free graphical debugger ptkdb is highly recommended - it's available as Devel::ptkdb from CPAN. The standard perl distribution also contains a powerful interactive debugger with a command-line interface (use it like "perl -d