ABCC GRID Promoter Analysis Descriptions
(02/09/2010 09:47:19)

<
Analysis Description
Promoter Extraction Promoter regions derived from the genomic sequence corresponding to bases from -1500 to +200 relative to the transcription start site are extracted. Genomic coordinates were identified using the refGene.txt file from the UCSC database. Because of refseq redundancies, many genes are represented multiple times to account for multiple transcripts. In addition, as some refseq genes map to more than one genomic position, there are positional redundancies as well. As a result, each refseq accession is amended to have a count of its unique position.
Composition The nucleotide composition of the promoter region is calculated. Counts and percentages for each base (a,g,c,t and n) are determined by counting over the entire extracted prmoter region (-1500 to +200). Composition analysis is performed for cross-promoter and cross-genome comparisons.
STR Identification Each putative promoter region is scanned for short tanden repeats (perfect) with repeat units between 2 and 16 in length and repeated at least 4 times. A simple regular expression is used. These sequences may be polymorphic and thus may affect the promoter function in some way.
Stem Identification Each putative promoter region is scanned for palindromes using the EMBOSS palindrome program and default parameters. These regions may play a role in some interaction event within the promoter region, or may be a conserved signal within the promoter region.
Repeat Identification Each putative promoter region is also scanned for pairs of repeated sequences. Identifed repeat sequences must be perfect repeats of at least 8 bases separated from one another by no more than 200 bases. These regions are extracted because they may also serve as recognition motifs for unknown binding factors within the promoter regions.
TF Site Information Source The transcription factor information was obtained from the NCBI ftp site (under the repository directory) as a series of files containing information about the site names, their consensus binding sites, their publications and other information and were placed there by the IFTI (http://www.ifti.org/). The files are periodically updated. In particular, the files tfsites-dynamic, tfsites-prolog and tfsites-sigscan were parsed into files for mysql and also files for scanning using a regular expression based search method implemented in Perl. This parsing process identified site names reported to have more than one binding site and amended their names to givee them uniqueness. In addition there were redundant sites for a given recognition consensus sequence and these were removed from the search for efficiency purposes. Both of these redundancy types were loaded into tables in our local mysql database and can be queried using tools from: Utils It is anticipated that a search method that relies upon profile searching rather than consensus site matching would both improve sensitivity and also selectivity in this process and these methods are currently being implemented.
TF Site Identification Each putative promoter region is scanned for all known transcription factor binding sites using the sites' consensus sequence using a simple perl regular expression matching algorithm.In this method, redundant IUPAC nucleotides are simply expanded to their full complements of possible bases and the search is performed using the constructed string. Additional methods that utilize profiles for scanning will soon be incorporated. For each promoter, each putative site is listed at all positions where a consensus match is found. Although very noisy, this analysis can identify groups of factor sites which co-occur in subsets of a list of input genes. The method benefits if genes from multiple species can also be compared. The scoring method used for CIM outputs assigns a value of 1 to those promoters that share a site. Please note that considerable enhancements could result form incorporating a similar position and similar neighbors analysis to this scoring algorithm and some of thes methods are currently being tested as well. This will likely result in a slightly different format for the data in the database since each position for each site will need to be considered rather than simple presence or absence.
TFSite Probabilities During the parsing of the TFSite information an estimated probability of encountering the site by random chance in a particular sequence is also calculated. The estimated probability of each SINGLE nucleotide was estimated based upon composition of all extracted human putative promoter regions. Their frequencins turned out to be very close to 0.25 for each nucleotide, so 0.25 was used as an estimate for each nucleotide. The probability is simple the produce of the individual probabilities for each consensus sequence. For example for a site AYYT, the probability would be 0.25 x 0.5 x 0.5 x 0.25. Users can filter the sites to remove those sites expected to be observed frequently using the probability filter option in many of the pages.
SNP Identification All of the identified (and genomically mapped) SNPs that fall within each putative promoter region are identified. These may indicate positions of potential polymorphism that could impact promoter function in an allele-specific manner.
Rmsk Identification Repeatmasker elements are identified for each putative promoter region. These elements may contain putative regulatory sequences, or other chromatin-affecting functionality. In addition, they are identified for PCR purposes.
Genscan Element Identification The program Genscan is used to identify putative promoter and gene information within each putative promoter region. This analysis reveals additional gene regions within a promoter element.
Promoter Menu Display Promoter Analysis Applications.
GRID Index Return to GRID Index.