Frequently Asked Questions

Take the tree and use Retree to make sure it is Unrooted (just read it into Retree and then save it, specifying Unrooted)
Use the unrooted tree as a User Tree (option U) in one of our programs (such as FITCH or DNAML). If you use FITCH, you also first need to use one of the distance programs such as DNADIST to compute a set of distances to serve as its input.
Specify that the branch lengths of the tree are not to be used but should be re-estimated. This is actually the default.

"I looked at the tree printed in the output file outfile and it looked weird. Do I always need to look at it in Drawgram?"

It's possible you are using the wrong font for looking at the tree in the output file. The tree is drawn with dashes and exclamation points. If a proportional font such as Times Roman or Helvetica is used, the tree lines may not connect. Try selecting the whole tree and setting the font to a fixed-width one such as Courier. You may be astounded how much clearer the tree has become.

"DrawTree (or DrawGram) doesn't work: it can't find the font file!"

Six font files, called font1 through font6, are distributed with the executables (and with the source code too). The program looks for a copy of one of them called fontfile. If you haven't made such a copy called fontfile it then asks you for the name of the font file. If they are in the current folder, just type one of font1 through font6. The reason for having the program look for fontfile is so that you can copy your favorite font file, call the copy fontfile, and then it will be found automatically without you having to type the name of the font file each time.

"Can DrawGram draw a scale beside the tree? Print the branch lengths as numbers?"

It can't do either of these. Doing so would make the program more complex, and it is not obvious how to fit the branch length numbers into a tree that has many very short internal branches. If you want these scales or numbers, choose an output plot file format (such as Postscript, PICT or PCX) that can be read by a drawing program such as Adobe Illustrator, Freehand, Canvas, CorelDraw, or MacDraw. Then you can add the scales and branch length numbers yourself by hand. Note the menu option in DrawTree and DrawGram that specifies the tree size to be a given number of centimeters per unit branch length.

"How can I get DrawGram or DrawTree to print the bootstrap values next to the branches?"

When you do bootstrapping and use Consense, it prints the bootstrap values in its output file (both in a table of sets, and on the diagram of the tree which it makes). These are also in the output tree file of Consense. There they are in place of branch lengths. So to get them to be on the output of DrawGram or DrawTree, you must write the tree in the format of a drawing program and use it to put the values in by hand, as mentioned in the answer to the previous question.

"I have an HP laser printer and can't get DrawGram to print on it"

DRAWGRAM and DRAWTREE produce a plot file (called plotfile): they do not send it to the printer. It is up to you to get the plot file to the printer. If you are running Windows this can probably be done with the Command tool and the command COPY/B PLOTFILE PRN:, unless your printer is a networked printer. The /B is important. If it is omitted the copy command will strip off the highest bit of each byte, which can cause the printing to fail or produce garbage.

"DNAML won't read the treefile that is produced by DNAPARS!"

That's because the DNAPARS tree file is a rooted tree, and DNAML wants an unrooted tree. Try using Retree to change the file to be an unrooted tree file. Our most recent versions of the programs usually automatically convert a rooted tree into an unrooted one as needed. But the programs such as DNAMLK or DOLLOP that need a rooted tree won't be able to use an unrooted tree.

"In bootstrapping, SEQBOOT makes too large a file"

If there are 1000 bootstrap replicates, it will make a file 1000 times as long as your original data set. But for many methods there is another way that uses much less file space. You can use SEQBOOT to make a file of multiple sets of weights, and use those together with the original data set to do bootstrapping.

"In bootstrapping, the output file gets too big."

When running a program such as NEIGHBOR or DNAPARS with multiple data sets (or multiple weights) for purposes of bootstrapping, the output file is usually not needed, as it is the output tree file that is used next. You can use the menu of the program to turn off the writing of trees into the output file. The trees will still be written into the output tree file.

"Why don't your programs correctly read the sequence alignment files produced by ClustalW?"

They do read them correctly if you make the right kind. Files from ClustalV or ClustalW whose names end in ".aln" are not in PHYLIP format, but in Clustal's own format which will not work in PHYLIP. You need to find the option to output PHYLIP format files, which ClustalW and ClustalV usually assign the extension .phy.

"Why doesn't NEIGHBOR read my DNA sequences correctly?"

Because it wants to have as input a distance matrix, not sequences. You have to use DNADIST to make the distance matrix first.

How to make it do various things

"How do I bootstrap?"

The general method of bootstrapping involves running SEQBOOT to make multiple bootstrapped data sets out of your one data set, then running one of the tree-making programs with the Multiple data sets option to analyze them all, then running CONSENSE to make a majority rule consensus tree from the resulting tree file. Read the documentation of SEQBOOT to get further information. Before, only parsimony methods could be bootstrapped. With this new system almost any of the tree-making methods in the package can be bootstrapped. It is somewhat more tedious but you will find it much more rewarding.

"How do I specify a multi-species outgroup with your parsimony programs?"

It's not a feature but is not too hard to do in many of the programs. In parsimony programs like MIX, for which the W (Weights) and A (Ancestral states) options are available, and weights can be larger than 1, all you need to do is:

(a)	In MIX, make up an extra character with states 0 for all the outgroups and 1 for all the ingroups. If using DNAPARS the ingroup can have (say) `G` and the outgroup `A`.
(b)	Assign this character an enormous weight (such as `Z` for 35) using the W option, all other characters getting weight 1, or whatever weight they had before.
(c)	If it is available, Use the A (Ancestral states) option to designate that for that new character the state found in the outgroup is the ancestral state.
(d)	In MIX do not use the O (Outgroup) option.
(e)	After the tree is found, the designated ingroup should have been held together by the fake character. The tree will be rooted somewhere in the outgroup (the program may or may not have a preference for one place in the outgroup over another). Make sure that you subtract from the total number of steps on the tree all steps in the new character.
In programs like DNAPARS, you cannot use this method as weights of sites cannot be greater than 1. But you do an analogous trick, by adding a largish number of extra sites to the data, with one nucleotide state ("A") for the ingroup and another ("G") for the outgroup. You will then have to use RETREE to manually reroot the tree in the desired place.

"How do I force certain groups to remain monophyletic in your parsimony programs?"

By the same method as in the previous question, using multiple fake characters, any number of groups of species can be forced to be monophyletic. In MOVE, DOLMOVE, and DNAMOVE you can specify whatever outgroups you want without going to this trouble.

"How can I reroot one of the trees written out by PHYLIP?"

Use the program RETREE. But keep in mind whether the tree inferred by the original program was already rooted, or whether you are free to reroot it without changing its meaning.

"What do I do about deletions and insertions in my sequences?"

The molecular sequence programs will accept sequences that have gaps (the "-" character). They do various things with them, mostly not optimal. DNAPARS counts "gap" as if it were a fifth nucleotide state (in addition to A, C, G, and T). Each site counts one change when a gap arises or disappears. The disadvantage of this treatment is that a long gap will be overweighted, with one event per gapped site. So a gap of 10 nucleotides will count as being as much evidence as 10 single site nucleotide substitutions. If there are not overlapping gaps, one way to correct this is to recode the first site in the gap as "-" but make all the others be "?" so the gap only counts as one event. Other programs such as DNAML and DNADIST count gaps as equivalent to unknown nucleotides (or unknown amino acids) on the grounds that we don't know what would be there if something were there. This completely leaves out the information from the presence or absence of the gap itself, but does not bias the gapped sequence to be close to or far from other gapped or ungapped sequences. So it is not necessary to remove gapped regions from your sequences, unless the presence of gaps indicates that the region is badly aligned.

"How can I produce distances for my data set which has 0's and 1's?"

You can't do it in a simple and general way, for a straightforward reason. Distance methods must correct the distances for superimposed changes. Unless we know specifically how to do this for your particular characters, we cannot accomplish the correction. There are many formulas we could use, but we can't choose among them without much more information. There are issues of superimposed changes, as well as heterogeneity of rates of change in different characters. Thus we have not provided a distance program for 0/1 data. It is up to you to figure out what is an appropriate stochastic model for your data and to find the right distance formulas.

"I have RFLP fragment data: which programs should I use?"

This is a more difficult question than you may imagine. Here is quick tour of the issues:

You can code fragment presence/absence as 0 and 1 and use a parsimony program. It is not obvious in advance whether 0 or 1 is ancestral, though it is likely that change in one direction is more probable than change in the other for each fragment. One can use either Wagner parsimony (programs MIX, PENNY or MOVE) or use Dollo parsimony (DOLLOP, DOLPENNY or DOLMOVE) with the ancestral states all set as unknown ("?"). The Wagner parsimony model allows change in both directions. Dollo parsimony allows more change in one direction than the other, and if the ancestral state is unknown it lets the data determine which way allows more change.
You can use a distance matrix method using the RFLP distance of Nei and Li (1979). Their restriction fragment distance is available in our program RestDist.
You should be very hesitant to bootstrap RFLP's. The individual fragments do not evolve independently: a single nucleotide substitution can eliminate one fragment and create two (or vice versa).

For restriction sites (rather than fragments) life is a bit easier: they evolve nearly independently so bootstrapping is possible and RESTML can be used, as well as restriction sites distances computed in RESTDIST. Also directionality of change is less ambiguous when parsimony is used. A more complete tour of the issues for restriction sites and restriction fragments is given in chapter 15 of my book (Felsenstein, 2004).

"Why don't your parsimony programs print out branch lengths?"

Well, DNAPARS and PARS can. The others have not yet been upgraded to the same level. The longer answer is that it is because there are problems defining the branch lengths. If you look closely at the reconstructions of the states of the hypothetical ancestral nodes for almost any data set and almost any parsimony method you will find some ambiguous states on those nodes. There is then usually an ambiguity as to which branch the change is actually on. Other parsimony programs resolve this in one or another arbitrary fashion, sometimes with the user specifying how (for example, methods that push the changes up the tree as far as possible or down it as far as possible). Our older programs leave it to the user to do this. In DNAPARS and PARS we use an algorithm discovered by Hochbaum and Pathria (1997) (and independently by Wayne Maddison) to compute branch lengths that average over all possible placements of the changes. But these branch lengths, as nice as they are, do not correct for mulitple superimposed changes. Few programs available from others currently correct the branch lengths for multiple changes of state that may have overlain each other. One possible way to get branch lengths with nucleotide sequence data is to take the tree topology that you got, use RETREE to convert it to be unrooted, prepare a distance matrix from your data using DNADIST, and then use FITCH with that tree as User Tree and see what branch lengths it estimates.

"Why can't your programs handle unordered multistate characters?"

In this 3.6 release there is a program PARS which does parsimony for undordered multistate characters with up to 8 states, plus ?. The other the discrete characters parsimony programs can only handle two states, 0 and 1. This is mostly because I have not yet had time to modify them to do so - the modifications would have to be extensive. Ultimately I hope to get these done. If you have four or fewer states and need a feature that is not in PARS, you could recode your states to look like nucleotides and use the parsimony programs in the molecular sequence section of PHYLIP, or you could use one of the excellent parsimony programs produced by others.

Background information needed:

"What file format do I use for the sequences?" "How do I use the programs? I can't find any documentation!"

These are discussed in the documentation files. Do you have them? You probably do. They are in a separate archive from the executables (they are in the Documentation and Sources archives, which you should definitely fetch). Input file formats are discussed in main.html, in sequence.html, distance.html, contchar.html, discrete.html, and the documentation files for the individual programs.

"Where can I find out how to infer phylogenies?"

There now a few books. For molecular data you could use one of these: At the upper-undergraduate level:

Graur, D. and W.-H. Li. 2000. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, Massachusetts. (or the earlier edition by Li and Graur).
Page, R. D. P. and E. C. Holmes. 1998. Molecular Evolution: A Phylogenetic Approach. Blackwell, Oxford.

and as graduate-level texts:

Nei, M. and S. Kumar. 2000. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford.
Li, W.-H. 1999. Molecular Evolution. Sinauer Associates, Sunderland, Massachusetts.

For more mathematically-oriented readers, there is the book

Semple, C., and M. Steel. 2003. Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications, volume 24. Oxford University Press, Oxford.

Best of all is of course my own book on phylogenies, which covers the subject for many data types, at a graduate course level:

Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts.

There are also some recent books that take a more practical hands-on approach, and give some detailed information on how to use programs, including some PHYLIP programs. These include:

Hall, B. G. 2004. Phylogenetic Trees Made Easy, 2nd edition. Sinauer Associates, Sunderland, Massachusetts.
Salemi, M., and A.-M. Vandamme (eds.) 2003. The Phylogenetic Handbook. A Practical Approach to DNA and Protein Phylogeny. Cambridge University Press, Cambridge.

A useful article introducing the inference of phylogenies at a more elementary level is:

Baldauf, S. L. 2003. Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19: 345-351.

There is an excellent guide to using PHYLIP 3.6 for molecular analyses available. It is by Jarno Tuimala:

Tuimala, J. 2004. A Primer to Phylogenetic Analysis using Phylip Package. 2nd edition. Center for Scientific Computing, Espoo, Finland.

and it is available as a PDF here.

In addition, one of these three older review articles may help:

Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. pp. 407-514 in Molecular Systematics, 2nd ed., ed. D. M. Hillis, C. Moritz, and B. K. Mable. Sinauer Associates, Sunderland, Massachusetts.
Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565.
Felsenstein, J. 1988. Phylogenies and quantitative characters. Annual Review of Ecology and Systematics 19: 445-471.

Questions about distribution and citation:

"If I copied PHYLIP from a friend without you knowing, should I try to keep you from finding out?"

No. It is to your advantage and mine for you to let me know. If you did not get PHYLIP "officially" from me or from someone authorized by me, but copied a friend's version, you are not in my database of users. You may also have an old version which has since been substantially improved. I don't mind you "bootlegging" PHYLIP (it's free anyway), but you should realize that you may have copied an outdated version. If you are reading this Web page, you can get the latest version just as quickly over Internet. It will help both of us if you get onto my mailing list. If you are on it, then I will give your name to other nearby users when they ask for the names of nearby users, and they are urged to contact you and update your copy. (I benefit by getting a better feel for how many distributions there have been, and having a better mailing list to use to give other users local people to contact). Use the registration form which can be accessed through our web site's registration page.

"How do I make a citation to the PHYLIP package in the paper I am writing?"

One way is like this:

Felsenstein, J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.

or if the editor for whom you are writing insists that the citation must be to a printed publication, you could cite a notice for version 3.2 published in Cladistics:

Felsenstein, J. 1989. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.

For a while a printed version of the PHYLIP documentation was available and one could cite that. This is no longer true. Other than that, this is difficult, because I have never written a paper announcing PHYLIP! My 1985b paper in Evolution on the bootstrap method contains a one-paragraph Appendix describing the availability of this package, and that can also be cited as a reference for the package, although it was distributed since 1980 while the bootstrap paper is 1985. A paper on PHYLIP is needed mostly to give people something to cite, as word-of-mouth, references in other people's papers, and electronic newsgroup postings have spread the word about PHYLIP's existence quite effectively.

"Can I make copies of PHYLIP available to the students in my class?"