MEME - Input formats

The preferred sequence format for MEME is Pearson/Fasta format. For example,

ICYA_MANSE

LACB_BOVIN

Sequences start with a header line followed by sequence lines. A header line has the character ``>'' in position one, followed by an unique name without any spaces, followed by (optional) descriptive text. After the header line come the actual sequence lines. Spaces and blank lines are ignored. Sequences may be in capital or lowercase or both.

MEME uses the first word in the header line of each sequence, truncated to 24 characters if necessary, as the name of the sequence. This name must be unique. Sequences with duplicate names will be ignored. (The first word in the title line is everything following the ">" up to the first blank.)

Sequence weights may be specified in the dataset file by special header lines where the unique name is ``WEIGHTS'' (all caps) and the discriptive text is a list of sequence weights. Sequence weights are numbers in the range 0 < w <=1. All weights are assigned in order to the sequences in the file. If there are more sequences than weights, the remainder are given weight one. Weights must be greater than zero and less than or equal to one. Weights may be specified by more than one "WEIGHT" entry which may appear anywhere in the file, but you must not put weights on lines that don't start with ">WEIGHT". When weights are used, sequences will contribute to motifs in proportion to their weights. Here is an example for a file of three sequences where the first two sequences are very similar and it is desired to down-weight them:

The web version of MEME also accepts protein and DNA sequences in any of the following formats by converting them to Pearson/Fasta format. When using these formats, it is not possible to specify sequence weights.

Sequence formats that allow one or more sequences:

IG/Stanford, used by Intelligenetics and others
GenBank/GB, genbank flatfile format
NBRF format
EMBL, EMBL flatfile format
DNAStrider, for common Mac program
Fitch format, limited use
Pearson/Fasta, a common format used by Fasta programs and others
Zuker format, limited use
Olsen, format printed by Olsen VMS sequence editor
Phylip3.2, sequential format for Phylip programs
Phylip, interleaved format for Phylip programs (v3.3, v3.4)
MSF multi sequence format used by GCG software
PAUP's multiple sequence (NEXUS) format
PIR/CODATA format used by PIR
ASN.1 format used by NCBI
Sequence formats that only allow one sequence. These formats cannot be used to input multiple sequences.
GCG, single sequence format of GCG software (use MSF format instead)
Plain/Raw, sequence data only (no name, document, numbering)