Back to the Phylogenetics Page
1) Use Seqboot to create (100 or 1000) datasets. (not showing indications of run will speed the analysis)
2) Rename the 'outfile' to 'infile'. [In UNIX: % rm infile % mv outfile infile]
3) Run Dnadist (change to sequential and multiple datasets)
4) Rename the 'outfile' to 'infile'.
5) Run Neighbor (change to multiple datasets)
6) Rename the 'treefile' to 'infile'.
7) Run Consense (commas in the names of taxa will cause a problem at this point!)
8) Find the bootstrap tree in 'outfile'.
sequences out of alignment, or base ratios wrong Possible problem is you have an sequential format file rather than a interleaved file, the solution is to choose the option "i" in phylip.
nice dnaml (follow menu to start program)
^Z (this will stop program and return you to command line)
jobs (this will list any jobs running and give their number)
bg %1 (this will start job # 1 to run in the backgound)
ps -u your_user_name -l (this will give a long list of all your jobs)
[ I strongly suggest you use the sequential format!-DC.]
The sequences can continue over multiple lines; when this is done the sequences must be either in "interleaved" format, similar to the output of alignment programs, or "sequential" format. These are described in the main document file. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names.
Here are hypothetical examples of
interleaved format: and sequential format (same sequences):
In interleaved format the present versions of the programs may sometimes have difficulties with the blank lines between groups of lines, and if so you might want to retype those lines, making sure that they have only a carriage- return and no blank characters on them, or you may perhaps have to eliminate them. The symptoms of this problem are that the programs complain that the sequences are not properly aligned, and you can find no other cause for this complaint.
5 42 5 42
Turkey AAGCTNGGGC ATTTCAGGGT Turkey AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCCGGGC AATACAGGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT
Chimp AAACCCTTGC CGTTACGCTT GAGCCGTGGC CGGGCACGGT AT
Gorilla AAACCCTTGC CGGTACGCTT H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
GAGCCCGGGC AATACAGGGT AT Chimp AAACCCTTGC CGTTACGCTT
GAGCCGTGGC CGGGCACGGT AT AAACCGAGGC CGGGACACTC AT
ACAGGTTGGC CGTTCAGGGT AA Gorilla AAACCCTTGC CGGTACGCTT
AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA
AAACCATTGC CGGTACGCTT AA
INPUT FOR THE DNA SEQUENCE PROGRAMS
The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The first line of the input file contains the number of species and the number of sites. As with the other programs, options information may follow this. Following this, each species starts on a new line. The first 10 characters of that line are the species name. There then follows the base sequence of that species, each character being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is no longer allowed, because it sometimes is used in different senses in other programs). Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing.
These characters can be either upper or lower case. The algorithms convert all input characters to upper case (which is how they are treated). The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.
Symbol Meaning Symbol Meaning ------ ------- ------ ------- A Adenine Y pYrimidine (C or T) G Guanine R puRine (A or G) C Cytosine W "Weak" (A or T) T Thymine S "Strong" (C or G) U Uracil K "Keto" (T or G) M "aMino" (C or A) V not T (A or C or G) B not A (C or G or T) X,N,? unknown (A or C or G or T) D not C (A or G or T) O deletion (letter) H not G (A or C or T) - deletion
The programs allow options chosen from their menus. Many of these are as described in the main documentation file, particularly the options J, O, U, T, W, and Y. (Although T has a different meaning in the programs DNAML and DNADIST than in the others).
The U option indicates that user-defined trees are provided at the end of the input file. This happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and DNAMLK, the trees must be strictly bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For DNAML and RESTML it must have a trifurcation at its base, e. g.: ((A,B),C,(D,E));. The root of the tree may in those cases be placed arbitrarily, since the trees needed are actually unrooted, though they look different when printed out. The program RETREE should enable you to reroot the trees without having to hand edit or retype them. For DNAMOVE the U option is not available (although there is an equivalent feature which uses rooted user trees).
This page is maintained by Dave Carmean with an eye towards speed and clarity, and last modified 8 April 1997. Comments or suggestions are welcomed!