4.1 Statistical analyses
DiProGB provides two types of statistical analyses. First, the user can calculate mean values for any partial or the complete sequence. This allows, for example, to compare the mean stacking energy for a given feature (e.g. gene) sequence with the corresponding mean value of the whole genome. Secondly, in the position-specific statistics selected sub-sequences are aligned relative to a specific sequence position and corresponding mean values are calculated for each position in the alignment.
4.1.1 Statistics table
All statistical analysis features provided by DiProGB are available in the statistics table. It can be opened by pressing the Statistics table button in the Statistics menu (4). The table can be closed and reopened without loosing its content. In the statistics table each row stands for either a partial or a complete genome. The first three columns contain the sequence name, the total length over all subsequences and the number of subsequences used for calculating the statistics. All other columns contain either average values of dinucleotide properties or nucleotide contents. Clicking once or twice on a column header will list the table entries in ascending or descending order. There are several possibilities for adding an additional row or column to the table. To add the mean value of dinucleotide properties for the sequence graph shown in the main window the user can press the Calculate->DiPro mean (selected sequence range) button. Other options are to use the To statistics button in the Popup menu (see section 2.1) or to add mean values for selected features via the Features list (see section 3.2) or using the Marked regions list (see section 2.6). If not already existing, a new row for the selected sequence part or a new column for the active dinucleotide property is created. The calculated mean value is then displayed in the corresponding field. Additionally, for each of the rows the average GC, purine and keto (G,T) content, as well as the content of the 4 nucleotides and the 16 dinucleotides can be stored.. Furthermore, for all columns expected values are available. Expected values for A, C, G, T content are 0.25. Accordingly, the expected values for the GC, Y or keto content are 0.5 and 0.0625 for any of the 16 dinucleotides. The expected dinucleotide property values are the average of the corresponding 16 dinucleotide values. The dinucleotide property means can be normalized (to be between zero and one) using the Normalized values button. Normalization allows better comparison of different mean values. In addition, statistical analyses called Position-specific statistics and Autostatistics (see section 4.1.2) can be performed in the Calculate menu. If for a row the Position-specific statistics has been performed (see section 4.1.5 for more information) the button Position-specific statistics diagram is enabled when selecting this row (left mouse click). A click on this button will open a new window showing the diagram of the Position-specific statistics for the selected row. The calculated numerical data can also be saved as a tab separated text file for further statistical analyses (e.g. with Excel).
The Autostatistics option (Calculate->Autostatistics) is basically a very simple scripting language for performing the more customized statistical analyses
for selected features, dinucleotide properties or genomes. It offers the following five commands that can be combined in a script-like
- Change sequence: - Opens the next sequence file from the Sequence list (starting with the first one) or opens the specified sequence file.
- Change DiPro: - Changes the currently selected dinucleotide property to the next one from the DiPro list (starting with the first one) or opens the specified dinucleotide property file.
- Calculate statistics: - This command calculates a statistics over the whole sequence (Full sequence) or for selected parts only (Selected parts). Filtering and selection is done by a case sensitive keyword search within selected features and qualifiers. This search is basically the same as in the Feature list (2.3). The search can be performed either within a selected qualifier or in all features or in all fields. The user can customize the features and their qualifiers defining the search space by pressing the Change selection button. To invert the search result to all regions not matching the search criteria the is not button can be selected. For the features matching the search criteria a statistics over each single subsequence (One statistics for - each selected) or a statistics over all subsequences together (One statistics for - all parts selected) can be added. For the statistics entry generated in the Statistics table a name prefix can be chosen. If one statistics for all parts matching the search criteria is selected a Position-specific statistics can be added additionally (see 4.1.5 for more information).
- Save/Delete statistics: - Saves the calculated means displayed in the statistics table and then clears the table. The data are saved as a tab separated text file in the specified path, filename and file extension. If this command is called more than once the output file is numbered starting with the specified number.
- Jump to ... : - This command jumps to a previous line in the script containing either the commands 1 or 2. This allows iterating through all selected sequence files or dinucleotide properties.
|Line Nr.||Command Nr.||Description|
|0||1||take next entry from sequence file list|
|1||2||take next entry from the dinucleotide property list|
|2||3||select whole sequence|
|3||3||add mean value for whole sequence|
|4||5||jump to script line Nr. 1 calling the next dinucleotide property|
|5||5||jump to script line Nr. 0 calling the next sequence file|
In this example mean values of all selected dinucleotide properties are calculated for all genomes of the selected sequence files. All mean values are added to the statistics table.
4.1.3 Mean values of dinucleotide properties
It is possible to calculate the actual mean value for the dinucleotide property values encoding the sequence and also the corresponding expected mean value (mean of the 16 dinucleotide property values). These two means are shown as horizontal lines in the graph if the button Exp. (expected mean) and Obs. (observed mean) are selected in the Statistics panel (4).
4.1.4 Randomized sequence graph
For a comparison of a given sequence graph with a graph having a randomized sequence the Random sequence button in the statistics panel can be used (4). This will open a new window showing a sequence graph with a randomized sequence. Randomization is done either by assuming an equal distribution of the four nucleotides or by retaining the nucleotide or the dinucleotide distribution of the original sequence. The random sequence graph is calculated and drawn with the same properties (amplitude, shifting window size, length, ...) as the original sequence graph. The parameters are automatically changed in the randomized sequence if they are changed in the original graph.
4.1.5 Position-specific statistics (PSS)
This is a Sequence Logo like statistics. For a selected set of
subsequences that are aligned according to a pre-selected position the mean values for each position are calculated and can be plotted in a diagram.
As all other results of statistical analyses the data generated by a Position-specific statistics are stored and displayed in
the Statistics table (see section 4.1.1). PSS examples are :
|Open in||Calculate PSS for||Possible application|
|Statistics table (4.1.1)||All features of selected classes (e.g. all genes, all CDS, ...).||PSS for all genes or exon start / end positions.|
|Auto statistics (4.1.2)||Features obtained from a search in one or multiple sequences||PSS for all genes of ribosomal proteins for several organisms|
|Feature list (2.3)||Selected features||PSS for all CDS of a gene group sharing the same name prefix|
|Marked regions (2.6)||Selected subsequences||PSS over all marked genome regions found by motif or repeat search|
When performing a PSS, first the sequences to be aligned must be selected. In a next step either the start or the stop position can be selected as reference for the alignment. In addition the length of flanking regions and a descriptive title can be chosen. Then the statistical analysis is performed and the corresponding entry appears in the Statistics table.
When a PSS entry in the Statistics table is selected the Position-specific statistics diagram button is enabled.
This diagram shows the A,T,G,C content (by default) for each aligned position. In the list on the right side all properties that
can be displayed in the diagram are shown and are displayed if the checkbox in the first column is checked. With help of the
Filter button above the list the user can change the content by adding or removing properties e.g. all dinucleotide properties,
dinucleotide - or trinucleotide – contents.
The second column of the list displays the diagram color of each property. This color can be changed after right-click onto the corresponding color. With help of the y_max text field above the diagram the user can restrict the maximum on the y – axis.
The menu allows to save the PSS either as data file or as picture of the diagram. With the option Display->Graph features the appearance of the diagram can be customized.
4.2 Motif search
DiProGB offers a tool for motif search. This search can be done in both, the primary nucleotide sequence and in the sequence graph.
The motif search is started via the menu (6) Tools->Search->Motifs. A motif sequence can be selected either by uploading it from a file,
by marking a sequence in the sequence graph and copying it into the Motif Search window, or by typing it directly into that window.
For a search in the letter-based nucleotide sequence the motif can also contain other letters than A, T, G and C.
The non-standard nucleotides are translated according to the PHYLIP translation table (shown in section 1.1).
After pressing the Start search button a new window showing different parameters option appears. In that window the user can
decide if the search is performed for the positive, the negative (complementary motif) or for both strands. It is also possible to
distinguish between linear or circular sequences and between the search for direct or reverse motifs or for both of them.
Finally, one can choose the percentage of sequence identity.
The motif search option is also available for the sequence graph. In this case the motif is translated into a so called motif graph.
The motif graph is then searched within the sequence graph applying a shifting window of size one for both. The dinucleotide property
values for the two graphs can also be binned to reduce the different values from 16 to the number of bins. For 100% match identity
the search is done in linear time using the Z algorithm of Gusfield (Dan Gusfield, Algorithms on Strings, Trees, and Sequences, Computer Science and Computational Biology, 1997, pp.7).
For both search options hits are shown in the right column indicating the first and the last sequence positions in the
genome and the fraction of matching positions in %. When selecting one or more motif entries in the hit list (left mouse click)
the location of the motifs is marked in the sequence graph by vertical red lines. The motif search results can be either added to
a GenBank file or written in a separate text file. The motif search can also be started by retrieving a subsequence by the
Get sequence button in the popup menu and pressing the Search button in the window displaying the selected sequence.
4.3 Repeat search
Repeats play an important role in the assembly and evolution of whole genomes, in the binding of DNA binding proteins, and generally
in gene regulation. Their sizes range from a few bases up to several mega bases. In humans, for example, nearly 30 hereditary disorders
result from an increase in the number of copies of simple repeats in genomic DNA. These DNA repeats have unusual structural features disrupting the cellular replication, repair and recombination machineries
(Review Article Expandable DNA repeats and human disease, Nature 447, 932-940 (2007)).
Larger repeats are often degenerated and ‘hidden’ in the corresponding DNA properties, i.e. the repeat is only visible by encoding it with the right DNA property. DiProGB also allows finding of such ‘hidden’ repeats. Larger repeats can often be found visually in the corresponding sequence graph. The existence of such repeats can be confirmed by using the implemented tools for motif search or using the repeat searching algorithms. The repeat search tool in DiProGB is focused on the search for supermaximal repeats, a widely used definition for direct repeats.
Definition: A maximal repeat m in a string S is a substring of S such that substrings amb and cmd occur in S for some a != c AND b != d (a,b,c,d are letters from the used alphabet; at most four copies of any maximal repeat). A maximal repeat can be a proper substring of another maximal repeat. A supermaximal repeat is a maximal repeat that does not occur as a substring of any other maximal repeat.
A linear space and time algorithm based on suffix arrays (M.I. Abouelhoda et al., Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms 2, 2004, 53-86) is implemented for the repeat search. It is started with the Tools->Search->Repeats option in the main window menu (6). (Super)maximal repeats can be searched either in the primary sequence or in the sequence graph. In the latter case the shifting window size (to smoothen the sequence graph) and the number of bins (to restrict the number of different values) can be specified. The results are shown in a table. In the first column the repeat motif (nucleotide sequence or set of bins for the search on the sequence graph) is shown. The second column contains a score calculated as the product of the repeat motif size and the occurrence in the specified range. The third column contains the motif length and the fourth column the number of occurrences (in the searched string). In the fifth column % sequence similarity between the motifs of one repeat is indicated. It is always 100% for a search in the letter-based sequence but typically lower for a repeat search in the sequence graph. The sixth column (Start positions) displays all first positions for each motif of one repeat. In the seventh column (Copy distances) all distances between two adjacent copies of one repeat are listed. A distance is measured as the starting position of the second copy minus the end position of the first copy. This means that overlapping copies have a negative distance. The eighth column shows the repeat type.
DiProGB can search for different types of repeats:
|Forward repeat||ATGC … ATGC|
|Reverse repeat||ATGC … CGTA|
|Complemented repeat||ATGC … TACG|
|Reverse complemented repeat||ATGC … GCAT|
To get the nucleotide sequence of a repeat the user can double right click on the corresponding entry in the first column. The sequences can then be used to perform a motif search with a similarity < 100%. As in the motif search, selection of entries of the results table (left mouse click) indicates the location of the repeats in the sequence graph by vertical lines. The user can save the results into a text file by clicking on Save.
DiProGB also allows the search for ‘degenerated repeats’, i.e. repeats with similarity < 100% (due to previous mutations). It is based on the idea that such mutated repeats still contain groups of smaller exact repeats.
a1 b1 a2 b2 (Motif start positions)
ATGxxTCT ... ATGxxxTCT
The user has to specify two parameters:
1. Max. difference between 2 first positions: |a1-b1|
2. Max. difference between 2 distances: ||a1-a2|-|b1-b2||
The corresponding repeat type is called ‘cluster’ and can best be seen after pressing the button "Open cluster in own window". This table contains one cluster per row. The clusters can again be marked on the sequence graph by clicking the corresponding row.
4.4 Fourier analysis
Fourier analysis is a tool to identify periodicities in a signal. Several periodicities which are related to different biological properties (e.g. codon structure, DNA - histone binding, ...) have already been described. DiProGB uses Fast Fourier Transformation (FFT) to calculate the spectrum for a given part of the sequence graph. After opening the tool (main menu (6) Tools->Fast Fourier Transf) the FFT is performed and by default the frequency spectrum for the first 1024 (=2^10) values of the displayed sequence graph is shown. The sample length of the data taken for the FFT can be changed by the user (a power of two). The chosen part of the sequence graph is indicated as horizontal red line (FFT sample) below the graph in the main window and always starts with the first value. The shifting window size is by default taken from the main window. The FFT is automatically recalculated if the user changes the used shifting window size in the main window. The user can also insert a new shifting window size that is applied before FFT and fix this value with the fixed button. The diagram offers two alternative displays of the same information: the Frequency spectrum or the Length spectrum. The latter shows the periodicity intensities for the different lengths contained in the chosen sequence. The Frequency spectrum shows the same peaks, but on an x-Axis where the lengths are divided by the total length of the chosen sequence (‘frequency’). In each case, the maximum amplitude is shown in the upper right corner of the diagram. When changing the fraction of the genome shown in the main window both diagrams are updated automatically. This allows to compare the spectra of different parts of the genome or to screen a whole genome for a special periodicity. Moreover, the left and right mouse buttons allow zooming in and out of the diagram, respectively. For further analysis the FFT spectrum, comprising all calculated complex values, can be saved in a file (Save as -> Data file -> FFT spectrum (all complex values)). The values of the currently shown diagram can be saved using Save as -> Data file -> Power spectrum (shown diagram) and a picture of the diagram can be generated using Save as -> Bitmap.