Prokaryotic Genome Analysis Tool (PGAT) Help Documentation


Overview

The Prokaryotic Genome Analysis Tool (PGAT) is designed to facilitate comparative analysis of closely related bacterial genomes under study. Because PGAT is web-based, it enables multiple researchers in remote locations to easily access the tool. A novel algorithm was developed to identify the genes present in a set of genomes, and to map orthologs between the genomes, also identifying pseudogenes. PGAT's web interface facilitates browsing of genes, and comparison between the different genomes. Features are available for finding genes with certain functions, and determining the presence and absence of genes among specific genomes. Select researchers can also update gene annotations and modify gene ortholog mappings. The system requires a curator to review all annotation proposals in order to enforce consistency and help coordinate the work done by multiple researchers. A history of annotation modifications is recorded and easily viewable through the tool. Users can access the tool through any standard web browser and are not required to install any client software onto their computers.

Several public PGAT instances developed for analyzing and comparing the genomes of particular bacterial species are readily available for researchers. The public may browse the genomes and use the search and comparison tools. Researchers contributing to the annotation and mapping of orthologs must sign into PGAT with a secure login in order to access the annotation update features.


Tutorial

To get started using PGAT and sample some of the tool's features, try out the tutorial. The tutorial will lead you through a gene search, retrieval of an ortholog table for comparing genomes, inspection of genes that make up a pathway, among a few other PGAT features.


Genome List

A PGAT tool is developed for comparative analysis of bacterial organisms belonging to a given genus. We have developed tools for multiple sets of organisms. For each genus, we selected a set of strains to include in the tool. The genome list page lists the genomes selected, along with genome details including:

Some users may be interested in comparing only a subset of the genomes supported through the tool. For example, a researcher may be studying Burkholderia pseudomallei, but not Burkholderia mallei. Users can therefore select only a subset of genomes using the checkboxes in the Genome List page to limit which genomes are used in comparative tools offered throughout PGAT.

In order to identify genes among the selected set of genomes, we have developed an algorithm to identify the full set of distinct genes and then map orthologs within each genome. For each gene, a "type gene" is designated in one of the genomes to serve as a reference. The genome "levels" indicate the order of genomes that was used to assign type genes. When possible, published annotations are utilized for mapping to locus tags, gene names and functional descriptions.

From the Genome List page, navigate down to a specific gene as viewed within the Gene (POSON) Details Page


Gene (POSON) Details Page

Overview

A POSON is a potentially coding sequence of nucleotides between stop codons in a particular translation frame. Thus, a POSON is an open reading frame (ORF) in the original sense. We use the term "POSON" to avoid the ambiguity in the use of the term "ORF" that sometimes is used to refer to only the translated portion of an open reading frame. For POSONs that have been designated as coding, the details page displays the location, gene name and description, and provides links to tools for comparing orthologs in other strains. The page also displays a variety of information about a given POSON, including results from a variety of bioinformatics sequence analysis programs, a graphic visualizing genome features and analysis results for the surrounding genome regions, and tools to view the current and past annotation assignments. The following sections describe these features in greater detail.

Visualization Graphic

The visualization graphic displays the currently selected POSON at the center with its name (either POSON number, locus tag or gene name) displayed in red. The region surrounding the selected POSON is also displayed. The location of the genome region being displayed is indicated by the coordinates displayed in the ruler located at the bottom of the image. POSONs are displayed as bars, solid bars indicating a POSON with CDS status (a coding gene), and open bars indicating POSONs that have not been designated coding (e.g. non-coding, a pseudogene, IS element or status unknown). The POSONs are displayed in its appropriate frame of translation with those located on the forward strand (+1, +2, and +3) shown above those on the reverse strand (-1, -2, and -3).

Some of the results of the analysis pipeline (e.g. COG, PFAM, BLAST) may also be displayed below the POSONs in different colors. Refer to the legend shown at the right of the image for details about the coding for the results. The legend also indicates how other genome features may be depicted in the graphic. POSONs are displayed as color-coded bars above (for forward strand) and below (for reverse strand) the 6-frame translation visualization of the POSONs. CDS Type genes are navy blue, CDS mapped genes are colored light blue, pseudogenes are colored red, and non-coding genes are colored gray.

The graphic can also be useful for navigating through the genome. Click on any POSON to jump to the POSON details page for that POSON. You can also zoom in and out by selecting the zoom level at the right of the image. The zoom levels range from 0.5kb to 20.0kb.

Annotation Summary

The Annotation Summary for a POSON is displayed directly below the visualization graphic. The annotation summary displays the currently accepted annotation for the POSON, including:

If the gene name, gene name aliases, poson status or start site have been updated by an annotator, and is currently awaiting curation, a * appears next to the value.

If the POSON is not designated as a type gene, this section also includes some information about the type in the Pan Genome Reference Gene section. Details include the strain in which the type gene was designated, the locus tag, and sequence.

Genomic Comparison

The Genomic Comparison section (dark blue) includes a table with the number of orthologous genes and pseudogenes found in the database. Links to various tools to facilitate comparison of orthologs among the different genomes include:

Visual Alignment

The Visual Alignment feature draws a graphic of the genomic regions surrounding a set of orthologs. Orthologs in the region are color coded so users can quickly assess synteny.

Ortholog Table

The ortholog table lists the orthologs for a gene found in any of the genomes the user selected from the Genome List page.

SNP Table

The SNP table lists all the SNPs identified in the gene based on the Muscle Multiple Sequence Alignment for orthologs in the genomes selected in the Genome List page.

FASTA Files

The FASTA File tool generates either an amino acid or nucleotide FASTA file of the orthologs from the selected genomes

Muscle (Multiple Sequence Alignment)

This link populates the input field of PGAT's Muscle tool interface with the amino acid or nucleotide sequence in FASTA format. The user can then Submit the sequence to retrieve the multiple sequence alignment as determined by Muscle

Bioinformatics Predictions and Other Tools

POSON

Details about the predicted potential ORF, including the predicted start and end positions within the genome. If the POSON is on the reverse strand, the start coordinate will be larger than than the end. The amino acid length is calculated based on these predicted start and end positions. Other details include the frame of translation and calculated GC content.

Glimmer [Glimmer website]

Glimmer is a system developed at the University of Maryland's Center for Bioinformatics and Computational Biology (CBCB) for finding genes in microbial DNA. We use Glimmer in our pipeline analysis to help predict whether a POSON is actually a coding gene. The Glimmer results displayed in PGAT include the program's prediction for the gene start site which may or may not be the same as the POSON's start position. The Glimmer predicted start site is often several bases downstream from the POSON start. The Glimmer score is also displayed to help determine the status of a POSON.

POSON Translation

The POSON Translation section displays the amino acid sequence for the POSON. The translation is performed by our analysis pipeline using the codon usage table for bacterial genomes. The amino acid corresponding to the currently assigned start site annotation is highlighted in red.

COG [COG NCBI website]

Clusters of Orthologous Groups of proteins (COGs) are phylogenetic classifications of proteins that are maintained by NCBI. These classifications were generated by comparing protein sequences found in the genomes of organisms from multiple major phylogenetic lineages. Our analysis pipeline performs RPS-BLAST queries to align POSON amino acid sequences with the proteins in the COG database. Alignment results are displayed in PGAT, including the name, description and category of the COG, along with the alignment scores, begin and end positions. The COG number is also displayed as a link to an NCBI page listing details about the COG. COG alignments are also often represented graphically in the image as a thin bar under the POSON, enabling researchers to quickly identify POSONs of interest.

COG analyses can be helpful for determining values for several annotation fields including gene description, name or functional class.

PFAM [PFAM website]

Pfam is a database of multiple alignments of protein domains or conserved protein regions. The alignments represent some evolutionary conserved structure which has implications for the protein's function. Pfam alignments can be used to determine whether a potential protein might belong to an existing protein family, even if the homology is weak. Pfam analyses are run on the POSONs in our pipeline and the results are displayed in the POSON details page. A link to the Pfam database is provided from these results for researchers to get further details about the predicted domain alignments. The Pfam alignment are also often represented graphically in the image as a thin bar under the POSON, enabling researchers to quickly identify POSONs of interest.

PSORT-B [PSORT-B website]

PSORTb is a bacterial subcellular localization prediction tool maintained by the Brinkman Laboratory and Simon Fraser University in British Columbia, Canada. We run the standalone version of PSORTb in our analysis pipeline for the protein sequence of each POSON. The results include the confidence value for each of the following localization sites:

If one of the sites has a score of 7.5 or greater, this site is designated the Final Prediction and its score is designated the Final Score. If two sites have high score, a flag of "This protein may have multiple localization sites" also appears in the Final Prediction field. PGAT also displays the PSORTb results for HMMTop, which predicts the number of transmembrane helices within the sequece, and for Signal, which predict the presence of a signal peptide. PSORTb results can be useful for researchers to determine the Subcellular Localization annotation value for a given POSON.

TMHMM [TMHMM website]

TMHMM is a program developed at the Center for Biological Sequence Analysis at the Technical University of Denmark for predicting transmembrane helices in proteins. We run the standalone version of TMHMM in our analysis pipeline for the protein sequence of each POSON. PGAT displays the number of predicted transmembrane helices, whether the n-terminus is predicted to be found inside or outside the membrane, and a reliability score.

SignalP [SignalP website]

SignalP is a program developed at the Center for Biological Sequence Analysis at the Technical University of Denmark for predicting the presence and location of signal peptide cleavage sites in amino acid sequences. We run the standalone version of SignalP in our analysis pipeline for the protein sequence of each POSON. PGAT displays the results for the Hidden Markov Model (HMM) method of prediction for SignalP. Included in the results displayed in PGAT are the prediction of whether the protein has a signal peptide ('Y' for signal peptide, 'N' for not signal peptide), the probability score, and the cleavage site with highest probablity. The cleavage site position indicates the first residue in the mature protein. SignalP results can assist researchers with determining the Subcellular Localization annotation to assign to a POSON.

BLAST [NCBI BLAST website]

BLAST is a tool for determining sequence similarities between genomes that is commonly used for comparative sequence analysis. Our pipeline analysis includes running NCBI's standalone BLAST tool (blastall) to find sequence similarities between the protein sequence of each POSON and the proteins of other genomes. PGAT will typicially include protein BLAST results for queries against the non-redundant (nr) database as well as for the Transport Classification Database (TCDB). We often additionally run BLAST queries against other genomes of interest for the particular genome under study, such as those belonging to the same bacterial genus.

PGAT displays summary statistics for the top four BLAST hits for a POSON against a given BLAST database. If users would like to inspect additional results, a link is also included to the pairwise alignment viewing option output results which includes the full set of results. The summary statistics displayed in the main POSON details page include the alignment begin, end and length for both the query and subject gene, as well as the e-value, score, % identity, and % positive values. Additional information is given for the subject gene, such as the name, organism, and description. The name will link to the NCBI protein viewer results for the given accession value.

If users want to perform a BLAST query against genomes other than the pre-selected genomes, they may select any of the genomes available in the integrated PGAT BLAST tool. This tool supports both nucleotide and protein BLAST.

View Surrounding Sequence Tool

The 'View Surrounding Sequence' tool enables users to view the nucleotide sequence upstream and downstream from the POSON sequence. The tool takes as input the POSON number, the number of upstream nucleotides (from the 5' end), and the number of the downstream nucleotides (from the 3' end) to display. Options are also available for two different output formats.

The first output format will display the nucleotide sequence in the region specified by the user along with the amino acid translation. If you specify an upstream nucleotides value that is not a multiple of 3, the tool will automatically offset the display of the translation such that it is still in the same frame as the selected POSON. The second output format will display just the nucleotide sequence in FASTA format. The output can be easily saved into a FASTA file on the local computer by selecting to save the page from the web browser in a Text Format.

BLAST Tool

A BLAST tool is integrated into PGAT to facilitate performing BLAST queries, particularly for the nucleotide or amino acid sequence for the current POSON, or to BLAST against a BLAST database formatted particularly for the PGAT tool for the genome under study. Users simply select a BLAST database from a dropdown list of pre-selected databases, specify whether to perform a nucleotide or protein BLAST query, select the format for the output (graphical, text pairwise, or text tabular), and enter the sequence to BLAST. The sequence to BLAST can be easily set to either the amino acid or nucleotide sequence of the POSON, or CDS if this is a coding sequence, using a set of radio buttons.

The graphical output of results will display the query sequence as a line at the top. Bars of varying color, indicating levels of percent identity, are drawn below this line in positions indicating the position and length of the alignment. The identifiers for each hit are displayed at the left of the bar. Details about each hit are also displayed in a table below the graphic. If the identifier of a hit is a valid GI number, a link is supplied to open the NCBI viewer for the protein. The text pairwise option displays the exact text output from the BLAST program for pairwise alignments, and similarly the text tabular displays the exact output from BLAST tabular results.

Additional BLAST databases can be queried against using the Advanced BLAST Query tool. If this link is selected, a new window will open displaying the Advanced BLAST options.


Updating Annotation

Annotators can submit proposals for updating the gene name or description of a gene. The start site indicating the location of the gene within the genome can also be updated. Reasoning for each proposed modification should be supplied in the available comment boxes to help curators decide whether to accept a proposal or not. Once a proposal is made, it is highlighted in yellow and tagged with status 'under review'. If a curator approves the update, it will be highlighted in brown and tagged with status 'most recently approved'. Acceptance by a curator will also officially update the annotation as shown in the Annotation Summary section at the top of each POSON page, as well as in any page displaying lists of POSONs.

Details about each annotation field and tips for using the analysis pipeline results and available PGAT tools to help determine well supported values are given below.

Start Site

The start site indicates the predicted translation start coordinate of a type gene. This value may or may not be the start position of the POSON. The coordinate value is used to indicate the sequence location of a CDS feature in a Genbank file or gene in a .ptt file. Start sites for orthologs in other genomes may be automatically updated when modifying the type gene start site using PGAT tools available to curators.

Gene Name

The Gene name annotation field indicates the name of a coding gene. In general, we have used the gene name specified in the genbank file of the type genome as default.

Description

The Description annotation field provides a description of the gene which could be a functional gene description or simply a gene class family. This field may be referred to as the 'product' in a Genbank file or .ptt file.

Type Gene Note

The Type Gene note is not considered part of the gene annotation but rather serves as a method for annotators to share important observations or ideas about the gene.

KO Number

The KO number is the KEGG Ortholog number assigned to the gene. KO numbers are originally mapped to type gene posons using KEGG's mapping of KO numbers to genes in a genome based on locus tags. Annotators can modify these mappings as needed. To unmap a type gene to a KO number, specify a blank KO number.

POSON start site

The POSON start site indicates the predicted translation start coordinate of a gene in a given genome. This value may or may not be the start position of the POSON. The coordinate value is used to indicate the sequence location of a CDS feature in a Genbank file or gene in a .ptt file. Tools for automatically updating POSON start sites for all orthologs when modifying the start site for the type gene are available to curators.

POSON note

POSON note is not considered part of the annotation but rather serves as a method for annotators to share important observations or ideas about the current POSON.


Search Options

Overview

Search for genes based on specified text found in the gene name, description. Also specify genes by poson or locus tag, or by predicted COG category or GO term. Gene lists from search results can be saved and modified, and then used for comparative analysis among different genomes.

Gene Search

Search for genes among all the genomes in a given PGAT instance by specifying a text string to look for in selected annotation fields, such as gene name and description. You may also list poson numbers or locus tags to find information about, and limit results by predicted COG categories or GO terms. Output options include displaying search results in a web table, a tab delimited text file, or FASTA file (amino acid or nucleotide sequences supported). For the web table and tab delimited text file output formats, options for fields to include in the output are provided.

The search results page shows the search text and fields selected, any options specified, and the list of genes that match the criteria. A search results list can be saved for future reference and for additional genome comparison queries. Once a list of genes is saved, genes can be added or removed from the list. To save the results from a search, enter a Search Name and click Save. Searches you save will only be available to you. Saved Searches can be retrieved through the Saved Searches tab.

Saved Searches

Any gene searches you have saved are listed in the Saved Searches tab. The name given to a search can be modified here, or a search can also be deleted here.

Orthologs

The Orthologs search option will generate an ortholog table indicating which genes are found in different genomes. Select any number of genomes available in the PGAT database to include in the analysis. If you have saved searches, you may also select to include only the genes in particular search results lists in the ortholog table. If no saved searches are selected, all genes are used in the analysis.

Pathways

View counts of genes in each KEGG pathway that are found in any selected number of genomes. KEGG pathways are organized into categories that can be expanded and collapsed. To view the results in a given category, click on the expand icon (+). An expanded KEGG pathway category will list all the KEGG pathways that fall in the category. The entry for each pathway includes the name, title, link to the pathway on the KEGG website, and the count of genes the make up the pathway. The total number of genes found in any of the PGAT genomes is also indicated to provide a reference point for comparison with the counts in each individual genome. For each individual genomes, the counts are also given, and are bolded if they do not match the total number of genes indicated. Pseudogene counts are shown in parentheses.

Click on any of the gene counts to display a graphic pathway map highlighting the genes present in the pathway for a particular genome. The graphic colors orthologs blue, pseudogenes, gene and genes that are not found in the genome pink. A list of the genes in the pathway is also displayed, including details such as KEGG Orthology (KO) number, gene name, description, and locus tag.

Presence and Absence

Find genes that are present in a subset of genomes in the PGAT database, and absent in another subset. An option to include pseudogenes in the analysis is available. Similar to the orthologs query, you can also specify any number of saved search gene lists to limit the analysis to. The results list will display the genes satisfying the input criteria, specifying the locus tag (or poson number) for the genes in the genomes they are present in.

Genome Coordinates

Find genes based on its location in a selected genome. The search options include:


SNPs

The PGAT SNPs feature lists all the SNPs for selected genes that appear for the subset of genomes selected in the Genome List page. The list includes gene, gene name, description, SNP position, mutations for each genome, and a flag indicating if the SNP is non-synonymous. Access PGAT's SNP features by selecting the SNPs menu item in the top navigation bar.

Preprocessed Files

Links to two preprocessed files are made available at the top of the page.

  1. Burkholderia pseudomallei
  2. Burkholderia mallei

These files contain the SNPs across all genes. The rules used to create these files are:

  1. For complete genomes, a single ortholog (exactly one, and no more) must be present for a gene to be listed
  2. For draft genomes, an ortholog may not exist, or multiple orthologs may exist (paralogs).

SNP query Interface

Select from the list of available genomes to show SNPs for in the SNP output table. You may limit the genes for which SNPs are shown by selecting one of your saved queries (see Search Options for how to save query gene lists). If no saved query is selected, SNPs found in any gene will be displayed.

You can also select whether to include genes with no ortholog or multiple orthnologs (paralogs) in some of the selected genomes. By default, only genes with a single ortholog in all selected genomes are displayed (relevant for phylogenetic analyses). Finally, you may select to view the results in a web table in your browser or to download a text formatted (tab-delimited) file. For large result sets, the text format option is recommended.

KEY

KEY for notations used in SNP tables:


Sequin

Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequences databases. To use PGAT's tools supporting Sequin, select the Sequin link in the top navigation bar. To view or generate Sequin files, select a genome from the dropdown menu and click Go. A table with the chromosomes or contigs with appear with links to any Sequin files already existing on the PGAT server. There are three steps to generate the necessary Sequin files for submission to NCBI.

  1. Create Locus tags: Using the assigned locus tag prefix for the chromosome, PGAT sequentially assigns locus tags to genes, pseudogenes, tRNA and rRNA in the order they appear in on the chromosome
  2. Create tbl file: The .tbl file is used by the GenBank program tbl2asn to generate all the necessary files for submission. PGAT generates the file in the correct format, specifying gene, pseudogenes, tRNA and rRNA locations, names, and products.
  3. Run tbl2asn: Runs the NCBI tbl2asn commandline tool using as input the .tbl and pre-generated template file to generate the necessary files for submission, including a Sequin (.sqn) file, a Genbank formatted file (.gbf), and some output validation files.

Tools

PGAT includes several tools to help with annotation tasks, tracking annotation changes, and comparing sequences.

Scan Annotation

Scan the annotations of genes for potential errors. Select from a list of error types to search for, including duplicated gene names, improperly capitalized gene names, non-alphanumeric gene names, gene names of unusual length, and blank descriptions.

Annotation History

Browse updates to gene annotations. Specify a date range when annotation modifications (by an annotator) or reviews (by a curator) were made.

Muscle

Run the Muscle software using the PGAT server on a user-defined set of nucleotide or amino acid sequences to generate a sequence alignment

Advanced Blast

Run blast directly on the PGAT server to identify sequence homologs


Curators Only Page

Overview

The Curators Only Page provides curators with the tools to view annotation proposals waiting review and to approve or reject them.

Reviewing Annotation Proposals

Annotation proposals for POSONs, pseudogenes, IS elements, tRNAs, rRNA, and other RNAs are displayed in separate sections. A separate table cell is displayed for each annotation field for a given POSON with an annotation update. On the left side of the table cell is the POSON number and the annotation field updated. The POSON number is also a link to view the POSON Details page for the POSON. The annotation history for this field is displayed with each line in the history representing each update proposal and the review status. The most recently approved annotation value is displayed in brown. Current proposals waiting review are displayed in yellow and have buttons available for approving or rejecting the proposal.

Once an annotation proposal is approved or rejected, the page is updated to show the new status (either 'most recently approved' if approved, or 'not approved' if rejected). The color of the line will also be updated. The window showing the POSON status will also be refreshed to reflect the results of the curated review.

Pseudogene, IS element and RNA proposal reviews include curated reviews for adding or deleting these features in addition to updates. Different colors are similarly used to indicate when a feature addition or deletion is waiting curated review. Just as for curated reviews for POSONs, the page will update with new status information and color coding to indicate the review results.

Updating Annotations

The Curators Only page also gives curators the opportunity to easily submit a proposal update. If, for example, a curator examines an annotation update proposal and determines that the true annotation should be something slightly different from both the proposal and the most recently approved annotation, then the curator can use this feature. Just as for annotators, curators should also provide evidence justifying the update in the comment field. After a curator proposal is submitted, it is treated just like any other proposal. The curator must then approve or reject their own proposed update.

Viewing Options

When a POSON (or other feature) link is selected or a proposal is approved or rejected, a window displaying the contents of the POSON details page for that POSON. By default, this window appears within the same browser window as the list of annotation proposals waiting review. This viewing mode works well for wide computer screens, however for screens of narrower dimensions, the details may get cut off. To adjust the viewing mode for such screens so that the POSON detail pages are shown within a new browser window instead of within the same window, select the option at top of the page.

Curation Sessions

After a curator approves or rejects a proposal, the page will update to reflect the change. The color of the line displaying the proposal will change from yellow to either brown to indicate approval, or white to indicate rejection. The buttons for Approve and Reject will also disappear. PGAT was designed to keep all reviewed proposals displayed in this page until the curator is finished looking at all of the proposals. In this manner, curators can easily keep track of proposals they have just reviewed. Once a curator is satisfied with all of the annotation updates, he or she can clear the curation session by selecting the 'Clear Curator Session' button at the bottom of the page. All reviewed proposals will then be cleared from the page.


Comments

Users are welcome to provide feedback about PGAT using the Comments page. Use comments to request gene annotation value updates and ortholog mapping modifications, or to simply report bugs or request features in the tool. When leaving a comment, please enter your name, institution, and a subject line along with your comment. All comments are listed at the bottom of the comment page in reverse chronological order, and will be visible to the public.


© University of Washington 2016