TRANSFAC Professional Documentation

SITE

The SITE table gives information on individual (regulatory) protein binding sites. It contains three different kinds of entries (

Statistics). First, there are sites within eukaryotic genes, the species of which ranging from yeast to human. Second, it comprises artificial sequences which resulted from mutagenesis studies, in vitro selection procedures starting from random oligonucleotide mixtures or from specific theoretical considerations. And finally, SITE includes consensus binding sequences given in the IUPAC code, many of them being taken from the compilation of Faisst and Meyer (Nucleic Acids Res. 20:3-26, 1992). The symbols used in addition to A, C, G, or T for these consensi are:

W = A or T	S = C or G
R = A or G	Y = C or T
K = G or T	M = A or C
B = C, G or T	D = A, G or T
H = A, C or T	V = A, C or G
N = A, C, G or T

A number of consensi has been generated by the TRANSFAC^® team, generally derived from the profiles stored in the MATRIX table. Here, the use of degenerate codes follows the following rules (adapted from Cavaner, Nucleic Acids Res. 15:1353-1361, 1987):

Rule 1: A single nucleotide is shown if its frequency is at least 50% and at least twice as high as the second most frequent nucleotide.
Rule 2: A double-degenerate code indicates that the corresponding two nucleotides occur in at least 75% of the underlying sequences and rule 1 does not apply.
Rule 3: Usage of triple-degenerate codes is restricted to those positions where one of the nucleotides did not show up at all in the sequence set and none of the afore mentioned rules applies.
Rule 4: All other frequency distributions are represented by the letter "N".

In addition to transcription factor binding sites, with release 10.4 we have started to include also mRNA sequence parts that are targets for miRNA interaction.

Fields

It should be noted that in individual entries some fields may be empty. In this case, these fields are not displayed.

	Site summary	general information about the SITE entry: number of linked transcription factors, of factor sources, of linked external database entries, and of references.
AC	Accession number	"R" + 5-digit number
AS	Accession numbers, secondary	when two or more entries are merged, the additional accession numbers, separated by commas, are stored in this field
ID	Identifier	{species acronym}${gene acronym}_{consecutive site number}
DT	Created Updated	date of entry creation; entry author date of last entry updating; updater
TY	Sequence type	D (DNA) or R (RNA)
DE	Description	short gene term (explicit gene name); GENE accession no.
OS	Species	biological species
OC	Taxonomic classification	systematic biological classification of the species
RE	Gene region	functional region of the gene (e. g. promoter, enhancer, intron etc.)
SQ	Sequence	Site sequence(s)
EL	Element	specific denomination of this site (if available), such as CRE (cAMP-response element)
S1 SF ST	Reference point Start position End position	the position numbers normally refer to the transcription start site (+1); where this is not the case, the reference point is stated explicitly
BF	Binding factors	factors shown to bind to this site (linked accession number; name; "Quality" of the factor-site interaction on a six level scale; biological species of the factor.)
MX	Matrices	nucleotide distribution matrices derived from alignment of this and other binding sites of a specific factor (linked accession number; identifier.)
SO	Cellular factor source	expression system (tissue, cell line, ...) the factor or binding activity was derived from (linked accession number; short description)
MM	Method(s)	methods applied for the identification of this site
CC	Comments	any additional comments, e. g. on the functionality of the site or on sequence conflicts with corresponding EMBL entries
DR	External database links	name of database (e. g. TRANSPRO, PathoDB, Flybase, EPD): database accession number; identifier (where available). EMBL: accession number; identifier (first site position:last site position). RSNP: accession number; EMBL: accession number; pos: SNP position in EMBL sequence; var: variation introduced by SNP.
RN	Reference number	[consecutive entry reference number]; reference accession number.
RX		PUBMED; link to PubMed entry.
RA	Reference authors	authors (NOTE: accents are omitted, German umlauts are transcribed as follows: ä -> ae, ö -> oe, ü -> ue; German "s-z" (ß) -> ss)
RT	Reference title	reference title (NOTE: Greek letters are expanded to alpha, beta, gamma etc.)
RL	Reference source	journal volume:pages (year)

Criteria

The first criterion for a site to be included in TRANSFAC^® is protein binding, the second is function. Assigned to each site is an unambiguous accession number and an identifier. The latter is composed of an abbreviation for the species (e. g., HS for human), a code for the gene description and a consecutive number for each entry referring to a particular gene. Thus, HS$BAC_02 refers to the second entry for the human gene beta-actin.

The description of a gene is the name of the gene itself or of its product, depending on what the more popular term may be.

The positions have preferably been taken from DNase I footprinting studies, if available. The next preference is for chemical modifications, the last for gel retardation assays. In case of different positional information for both DNA strands, the more upstream position has been taken for the 5' border, the more downstream position for the 3' border of the site. If not stated otherwise in the field "Reference point (+1)", the position numbers generally refer to the transcription start site. Occasionally (or normally for yeast genes due to their generally more heterogeneous cap site), they may refer to the translation start codon stated as "Reference point (+1) | ATG". Other reference systems such as defined restriction sites may be indicated as well. If "First position of element" and "Last position of element" are missing, no positions were given by the references cited. If "First position of element" has a negative or positive value, but "Last position of element" is missing, no precise boundaries of the site were given, but it was located "around position" instead.

The sequences depicted have been taken from the literature. Some conflicting data with sequences within the EMBL data library are mentioned in the comment field. In case of diverging site borders on both strands, only the overlapping sequence is given. When the authors emphasized a certain sequence motif within a sequence, it is written in capitals while the rest of the sequence is shown in lowercase letters.

Cross-references to the EMBL data library also give the positions of the TRANSFAC^® site within the EMBL sequence, negative numbers pointing to the complementary strand.

The factor which binds to this sequence element is given with its TRANSFAC^® accession number of the FACTOR table and (one of) its name(s) (see FACTOR table for possible synonyms), and a "quality" value ranging from 1 to 6 reflecting the experimental reliability of a certain protein-DNA interaction. These values have the following meaning:

functionally confirmed factor binding site
binding of pure protein (purified or recombinant)
immunologically characterized binding activity of a cellular extract
binding activity characterized via a known binding sequence
binding of uncharacterized extract protein to a bona fide element
no quality assigned

The cellular protein source leading to the identification of a particular site is included in the SITE table as well.