Extended Analysis Tools

Search Handbook

Extended Analysis Tools

BOLD includes core and extended tools to analyze specimen and sequence data.

The biggest improvement to the Extended Analysis Tools with the new 3.6 version of BOLD is that the Alignment Browser now allows users to edit and save sequences.

New functionality: Complete Deletion

BOLD now handles Complete Deletion for ambiguous bases and gaps. This option is available on the parameters page for the Distance Summary, Alignment Viewer, and Barcode Gap Summary.

Analysis Parameters for Complete Deletion

Illustration of the new deletion handling option on a analysis parameter page.

Email Analysis results to allow for parallel workflows

Use the option to run multiple analyses by having the results emailed when each analysis is finished. Results can be stored for up to 4 weeks, saved for future comparison, and links to the results can be shared between collaborators. This option is available on the parameters page for most of the BOLD analysis tools.

Illustration of the option to email results from analysis on the tool parameters page

Publishing Results

When the “Expand” icon appears next to a graph, the graph is expandable for a higher quality size that can be used in publications.

Illustration of the Expand icon to the right of a graph.

Distance Summary

It is desirable for barcodes to show very low sequence divergence within a species, with significantly higher sequence divergence at higher taxonomic levels. The Distance Summary tool gives a report of sequence divergence between barcode sequences at the conspecific and congeneric levels.

Parameters

Various distance models and alignment algorithms are available as parameters, as well as options to filter out sequences based on sequence length or sequence issues.

Results

Comparisons are performed between the given taxonomic levels with the frequency plotted as shown below. There is one visualization provided that is normalized by species to remove sampling bias from the equation. Details for the comparisons done at the level of species, genus, and family are available by clicking on the links in the top right corner.

Distance Summary page

tag_account
tag_sequence
tag_project
tag_analysis

Sequence Composition

The frequency of DNA bases, observed with emphasis on GC-content, can be a useful metric for evolutionary biologists. For example, GC-content within the barcoding region of CO1 has been correlated with GC-content of the entire mitochondrial genome for many species.

Parameters

Various distance models and alignment algorithms are available as parameters, as well as options to filter out sequences based on sequence length or sequence issues. The default parameters allow for GC percentages to be calculated on overall sequence composition as well as codon positions 1, 2 and 3 - but these may be unselected if desired.

Results

The results page provides statistics on the frequency of each base (G, C, A and T) in the selected records and can display histograms for GC content on all codon positions.

Sequence Composition results page

tag_account
tag_sequence
tag_project
tag_analysis

Barcode Gap Summary

The Barcode Gap Summary presents users with an examination of the distance to the nearest neighbour for each of the species in the list of selected specimens.

Parameters

Various distance models and alignment algorithms are available as parameters, as well as options to filter out sequences based on sequence length or sequence issues.

Results

Distances are highlighted if the nearest neighbour is less than 2% divergent, or when the distance to the nearest neighbour is less than the intra-specific distance. Warnings presented by this tool may be summarized by clicking on the link in the top right corner of the Barcode Gap results page.

Nearest Neighbour Barcode Gap Analysis Results

tag_account
tag_sequence
tag_project
tag_analysis

Accumulation Curve

An Accumulation Curve of standardized DNA barcodes and related features provides a clear, transparent, and reproducible estimate of the diversity and sampling efficiency of areas or collections. This tool also allows users to quickly compare sampling efficiency at multiple regions by multiple taxonomic levels.

Parameters

Each curve is a plot of the number of species, genera, subfamilies, and/or BINs as a function of the number of samples. The Extra Info field can also be plotted, for example to graph morphotypes. As the tool allows for multiple graphs, it can help a researcher determine which geographic regions are producing less new groups (creating multiple graphs by country, province, or region), or which taxonomic group is plateauing (creating multiple graphs by phylum, class, order, family, or subfamily). The Extra Info field can also be utilized to investigate efficiency of sampling protocols, progress in FAO regions, etc.

Sampling order can be randomized, and for a large dataset, a higher a degree of smoothing may be optimal via more iterations; however this will take longer to calculate. Order of submission can also be chosen to visualize the impact of sampling efforts.

Results

A steep slope indicates that a large fraction of the diversity remains to be discovered. A curve that is flatter to the right indicates that a reasonable number of individual samples have been collected and more intensive sampling is likely to yield only few additional groups.

Accumulation Curve results page

tag_account
tag_sequence
tag_project
tag_analysis
tag_taxonomy

Alignment Browser

Managing sequence alignments and base calls is a critical step in any barcode analysis. To prevent the inconvenience of importing sequences into 3rd party software to analyze and edit, BOLD provides an integrated alignment browser that includes many features popular in other packages. In the newest version of BOLD, the updated alignment browser supports direct editing to the database. Multiple alignment options such as MUSCLE and Kalign algorithms, as well as colourization options, are also available.

Sequence Editing

In the newest version of BOLD, the updated Alignment Browser supports direct editing to the database. Users can select sequences or single bases then right click to see editing options. Once edited, the entire session can be submitted to upload the edited sequences to their records.

Parameters

Various distance models and alignment algorithms are available as parameters, as well as options to filter out sequences based on sequence length or sequence issues.

Alignment Browser page

tag_new
tag_account
tag_sequence
tag_project
tag_analysis

Diagnostic Characters

The Diagnostic Character analysis provides a means to examine nucleotide or amino acid polymorphism between sets of sequences that are grouped by taxonomic or geographic labels. More specifically, this tool identifies consensus bases from each group, compares them to those from the remaining sequences in other groups, then characterizes how unique each consensus base is. The purpose of this tool is categorizes consensus bases by their diagnostic potential, which are categorizes as followed:

Characterizations in Diagnostic Characters tool (* base is either nucleotide or a residue)
Abbreviation	Name	Meaning
D	Diagnostic	At this position in the MSA, the base* is found only in one group.
DP	Diagnostic or Partial	Due to the presence of ambiguous base(s) in other groups, this base* may be classified as P if the same character also appears in some but not all sequences in other groups OR D if it does not appear at all.
P	Partial Character	At this position in the MSA, this base is found in all sequences in this one group, however it is also found in some but not all sequences in other groups.
PU	Partial or Uninformative Character	Due to the presence of both ambiguous bases and this base of interest in all sequence in at least one other group, this base* can either be partial or uninformative depending on how many ambiguous bases in the other group are truly the same as the base in question.
I	Invalid Character	Only ambiguous bases are present in all sequences in other groups. Since D, P and U are all possible, nothing can be said about this base, hence it is declared as invalid.
U	Uninformative Character	More than 1 group share this consensus base. This base holds no diagnostic ability and cannot be used in any subsequent diagnoses.

Parameters

Since this tool only performs the analysis on the set of sequences selected by the user, the result is greatly affected by the initial data and the analysis parameters. Even the smallest change in the initial sequences, filtering options, or the analysis parameters can cause the consensus sequences in each group and hence the diagnostic potential to be different between analyses. As a result, the interpretation of each analysis is absolutely dependent on all the factors combined. In general, having more sequences per group will provide a more accurate diagnosis of each group, as it reduces the problem caused by small sample size.

Algorithm

Sequence alignment of all the sequences serves as the starting point of this analysis. Alignment algorithm is one of the options available for the user to specify.
Based on the grouping attribution, sequences are separated into various sets.
Consensus sequences within each sequence set are collected.
For each group, the consensus bases are examined one by one and compared to the bases found in all the remaining sequences. Based on the number occurrence and percentage of occurrence of the consensus base in other groups (see table above for definition), the diagnostic potential of that base to the current group is determined.

Diagnostic Examples Diagnostic Characters examples

Diagnotic Characters Diagnostic Characters results page

tag_account
tag_sequence
tag_project
tag_analysis

BIN Discordance Report

The Barcode Index Number (BIN) module analyzes new COI sequences and assigns them to an existing or a new BIN. Please visit the BIN documentation for more details. Besides generating BIN pages, this system acts as a rapid check of the validity of taxonomic designation on specimen records.

The BIN Discordance Report facilitates this check by comparing the taxonomy on selected records against all others in the BINs they are associated with.

Results

The results are sorted by the degree of conflict, displaying those records in BINs where there is a phylum level conflict first (likely the result of cross-contamination) down to species level conflicts. Users can select and retrieve records from this page to examine ancillary data, comment, tag, or edit the taxonomy where there is a confirmed error.

The report also lists records that are in BINs that contain no taxonomic discordance (see the Concordant BINs tab in the results page), as well as records that are in BINs that contain no other sequences (see the Singletons tab).

BIN Discordance Report page

tag_account
tag_bin
tag_project
tag_analysis

Handbook

Search Handbook