| Analysis |
Description |
| Promoter Extraction |
Promoter regions derived from the genomic sequence corresponding
to bases from -1500 to +200 relative to the transcription start
site are extracted. Genomic coordinates were identified using the
refGene.txt file from the UCSC database.
Because of refseq redundancies, many genes
are represented multiple times to account for multiple transcripts.
In addition, as some refseq genes map to more than one genomic
position, there are positional redundancies as well. As a result,
each refseq accession is amended to have a count of its unique
position.
|
| Composition |
The nucleotide composition of the promoter region is calculated.
Counts and percentages for each base (a,g,c,t and n) are determined
by counting over the entire extracted prmoter region (-1500 to +200).
Composition analysis is performed for cross-promoter and cross-genome
comparisons.
|
| STR Identification |
Each putative promoter region is scanned for short tanden repeats
(perfect) with repeat units between 2 and 16 in length and
repeated at least 4 times. A simple regular expression is used.
These sequences may be polymorphic and thus may affect the promoter
function in some way.
|
| Stem Identification |
Each putative promoter region is scanned for palindromes using the
EMBOSS palindrome program and default parameters. These regions may
play a role in some interaction event within the promoter region,
or may be a conserved signal within the promoter region.
|
| Repeat Identification |
Each putative promoter region is also scanned for pairs of repeated
sequences. Identifed repeat sequences must be perfect repeats of
at least 8 bases separated from one another by no more than 200 bases.
These regions are extracted because they may also serve as recognition
motifs for unknown binding factors within the promoter regions.
|
|
TF Site Information Source
|
The transcription factor information was obtained from
the NCBI ftp site (under the repository directory)
as a series of files containing information about
the site names, their consensus binding sites, their publications
and other information and were placed there by the IFTI
(http://www.ifti.org/). The files are periodically updated.
In particular, the files tfsites-dynamic,
tfsites-prolog and tfsites-sigscan were parsed into files for
mysql and also files for scanning using a regular expression based
search method implemented in Perl. This parsing process identified
site names reported to have more than one binding site and amended
their names to givee them uniqueness. In addition there were redundant
sites for a given recognition consensus sequence and these were removed
from the search for efficiency purposes. Both of these redundancy types
were loaded into tables in our local mysql database and can be queried
using tools from:
Utils
It is anticipated that a search method that relies upon profile
searching rather than consensus site matching would both improve
sensitivity and also selectivity in this process and these methods
are currently being implemented.
|
|
TF Site Identification
|
Each putative promoter region is scanned for all known transcription
factor binding sites using the sites' consensus sequence using a simple
perl regular expression matching algorithm.In this method, redundant
IUPAC nucleotides are simply expanded to their full complements of
possible bases and the search is performed using the constructed string.
Additional
methods that utilize profiles for scanning will soon be incorporated.
For each promoter, each putative site is listed at all positions where
a consensus match is found. Although very noisy, this analysis can
identify groups of factor sites which co-occur in subsets of a list
of input genes. The method benefits if genes from multiple species
can also be compared. The scoring method used for CIM outputs assigns a
value of 1 to those promoters that share a site. Please note that
considerable enhancements could result form incorporating a similar
position and similar neighbors analysis to this scoring algorithm
and some of thes methods are currently being tested as well. This will
likely result in a slightly different format for the data in the database
since each position for each site will need to be considered rather than
simple presence or absence.
|
|
TFSite Probabilities
| <
During the parsing of the TFSite information an estimated probability
of encountering the site by random chance in a particular sequence is also
calculated. The estimated probability of each SINGLE nucleotide was estimated
based upon composition of all extracted human putative promoter regions.
Their frequencins turned out to be very close to 0.25 for each nucleotide,
so 0.25 was used as an estimate for each nucleotide. The probability is simple
the produce of the individual probabilities for each consensus sequence.
For example for a site AYYT, the probability would be 0.25 x 0.5 x 0.5 x 0.25. Users can filter the sites to remove those sites expected to be observed
frequently using the probability filter option in many of the pages.
|
|
SNP Identification
|
All of the identified (and genomically mapped) SNPs that fall within
each putative promoter region are identified. These may indicate
positions of potential polymorphism that could impact promoter
function in an allele-specific manner.
|
|
Rmsk Identification
|
Repeatmasker elements are identified for each putative promoter region.
These elements may contain putative regulatory sequences, or other
chromatin-affecting functionality. In addition, they are identified
for PCR purposes.
|
|
Genscan Element Identification
|
The program Genscan is used to identify putative promoter and gene
information within each putative promoter region. This analysis
reveals additional gene regions within a promoter element.
|
| Promoter Menu |
Display Promoter Analysis Applications. |
| GRID Index |
Return to GRID Index. |