[Special]: Development of a Web Based Database for Orphan Genes of the Apicomplexa
|About this Post
The following post comprises of a final year Honours project and as such is quite a lengthy post comprising of almost 10,000 words. If you are looking for something particular within the article, I would recommend using Ctrl-F to bring up your browser’s search tool. This will allow you to type in a word or phrase and your browser will scan the entire page for you. For those interested in ApiBLAST.vetsci.co.uk , due to server limitations, users outside of a Liverpool university network will not be able to access any further than the home page. If you would like to see ApiBLAST in action, I would recommend viewing this video: http://apiblast.vetsci.co.uk/how-to.swf . Thank you for your interest.
Abstract
With the parasite Neospora caninum contributing to great amounts of cattle deaths worldwide it has become an important target for control (Dubey, Lindsay 1996). N. caninum infected cattle appear symptom free until pregnancy, at which point the dormant N. caninum reverts to an actively dividing tachyzoite stage, the mechanism of this reactivation remains unknown (Trees, Williams 2005). This mechanism is not observed elsewhere in other species of the apicomplexa with the exception of Toxoplasma gondii where similar reactivation occurs in the immunocomprimised, such as during pregnancy in sheep. It is likely non-homologous or ‘orphan’ genes of N. caninum may therefore be responsible. Developments in bioinformatics mean it is now possible to quickly locate these orphan genes which may be involved in reactivation and analyse their functions using freely available software on the internet. However with no direct means to both store and search potentially involved genes with the flexibility required, both a website and MySQL database were created to support this investigation. The website, ApiBLAST (ApiBLAST.vetsci.co.uk), is an efficient tool to quickly locate, filter and analyse orphan genes. The database stores information on homology, derived from e-values determined by BLAST and subcellular localisation & structure determined by a number of tools maintained by the Center for Biological Sequence Analysis, including SignalP (Bendtsen et al. 2004), TargetP (Emanuelsson et al. 2000) and TMHMM (Krogh et al. 2001). Twelve candidate genes were isolated from over 15,000, their functions not currently fully understood and thus possibly of interest for future investigation. ApiBLAST has great potential and flexibility as a bioinformatics tool, which this study has proven by the successful identification of the twelve candidate genes.
Acknowledgements
I would like to thank Dr. Andy Jones for his advice and guidance for the duration of this project. It was with his assistance that I was able to learn a number of new skills (such as many of the computing languages used in this study) which I can continue to use even as the project reaches its close.
I would also like to express thanks to all members of the Post-genomic Bioinformatics group at Liverpool who also offered continued advice and were always available to offer support.
Contents
- Abstract
- Acknowledgements
- Introduction
- Neospora caninum
- Orphan Genes
- BLAST
- Determining Subcellular Localisation
- Perl & MySQL
- User Interface
- Development Methods & Protocols
- Development Environment
- Creating the Database
- Running BLAST
- Initial Development of the ApiBLAST Website
- Storing Data Using MySQL
- Retrieving Results from BLAST
- Determining Subcellular Localisation Using the SignalP Protocol
- Final Amendments to the ApiBLAST Website and MySQL Database
- Method of Analysis
- Results
- NCLIV 010400 – IPR016196
- NCLIV 010610 – IPR001727
- TGME49 113760 – IPR001841
- ApiBLAST MySQL Database
- The ApiBLAST Website
- Discussion
- References
- Appendices
- Appendix 1 – Species Included in the Complete Database on the ApiBLAST Website
- Apicomplexa
- Kinetoplastidia
- Microsporidia
- Model Organisms
- Appendix 2 – Sample BLAST Perl Script
- Appendix 3 – Home Page of ApiBLAST
- Appendix 4 – Sample Perl/HTML Used to Create the ApiBLAST Home Page
- Appendix 5 – The CSS Styles Applied to ApiBLAST
- Appendix 6 – Sample MySQL Language Structure
- Appendix 7 – Sample Perl Script Required to Connect to MySQL
- Appendix 8 – Query Page of ApiBLAST
- Appendix 9 – Results from a SignalP Query
- Appendix 10 – Results Page from an NCBI BLAST Search for NCLIV_010400
Introduction
The development and refinement of bioinformatics tools over recent years has dramatically changed the way we can deal with large amounts of data. Huge datasets can now be processed relatively quickly by the processing power of the typical computer. With this in mind, this project will involve the search for a handful of proteins from over 225,000 known Apicomplexa proteins as listed at the NCBI website (NCBI 2011a).
Neospora caninum causes neosporosis in cattle and can subsequently cause abortion. Because of this neosporosis poses a major economic and welfare threat across the globe (Dubey, Lindsay 1996). One characterising feature of N. caninum infected cattle is that the parasite lies dormant until pregnancy, at which point it becomes active again and is responsible for either abortion or the birth of a N. caninum infected calf. The method by which N. caninum reactivates during pregnancy is currently unknown. This project will include a bioinformatic study into the possible proteins and their genes which could be responsible for the reversion to pathogenicity of N. caninum. To aid this study, a website (‘ApiBLAST’) will be developed alongside the search, acting as both a database in which results can be stored and a tool from which a number of bioinformatic processes can be run. ApiBLAST will be developed with the end user in mind, thus being aesthetically pleasing, informative and easy to use. ApiBLAST is available at http://ApiBLAST.vetsci.co.uk.
Neospora caninum
Neospora caninum is classified within the phylum Apicomplexa and is morphologically similar to the organism Toxoplasma gondii (Trees, Williams 2005). Worldwide, N. caninum is responsible for a large proportion of cattle abortions (Dubey, Lindsay 1996), posing significant economic and welfare losses with around 13% of all UK cattle abortions attributable to neosporosis (Dubey 1999).
Transmission of N. caninum can occur either endogenously or exogenously. Exogenous transmission of N. caninum is propagated by the definitive host, typically canids such as coyotes and dogs (McAllister et al. 1998). The definitive host sheds oocysts of the parasite in its faeces; on farmland (or other areas where cattle may be kept) this can lead to possible contamination of food stores and water sources especially if no preventative measures are imposed to prevent canid access. In such an event, the oocyst-contaminated food or water may be consumed by cattle; this can lead to either abortion (McAllister et al. 1996) or persistent infection with N. caninum (Trees, Williams 2005).
Should cattle become persistently infected, there is the constant possibility of transplacental parasite transmission from dam to foetus. Following consumption of oocysts the parasite develops to its bradyzoite stage, a dormant stage of its lifecycle where it remains quiescent in neurological and muscular cysts. However during pregnancy the dormant bradyzoite develops into its active tachyzoite stage, during this phase of its lifecycle the parasite undergoes rapid division and will actively invade other tissues including placental and foetal, this is known as exogenous transplacental transmission (Trees, Williams 2005).
If transplacental transmission occurs early in gestation, then the calf is aborted. If transmission occurs later in gestation, it is more likely the calf will be born with a persistent infection of N. caninum. An overview of this process can be found in figure 1.
The mechanisms involved in the reactivation of quiescent bradyzoites are currently unknown. By locating orphan genes within the Apicomplexa, specifically N. caninum, it may be possible to establish which proteins and thus which genes are responsible for the reactivation of the quiescent bradyzoites during pregnancy.
Orphan Genes
Orphan genes are those without homologues in other species. In the case of N. caninum reactivation during pregnancy, this mechanism is not observed elsewhere within the Apicomplexa although similar mechanisms are observed within T. gondii (T. gondii is therefore coupled with N. caninum during this search for orphan genes). Because of this, the theory is that searching for genes without homologues (i.e. orphan genes – those with a novel genetic structure unique to T. gondii and N. caninum), will help identify candidate genes involved in the reactivation of bradyzoites during pregnancy.
BLAST
The locating of these orphan genes will rely heavily upon the use of bioinformatics. The core search will require the use of BLAST (Basic local alignment search tool) (Altschul et al. 1990), BLAST is a reliable and relatively fast tool used to perform sequence comparisons of nucleotides or proteins. By comparing the query against often vast databases, BLAST is able to determine the similarity of a protein or nucleotide sequence to those within the database. BLAST attempts to locate the highest scoring pair of identical length segments from two sequences by assigning them a score which alters depending on the number of matches, mismatches and gaps. This score is known as the Maximal Segmental Pair score (MSP score) and provides a measure of local similarity for any pair of sequences.
Using just a single query with BLAST can generate hundreds of results. A simple way to determine which result is the most similar to your query is to compare the E value (Expectation Value). The smaller the E value is, the more significant the match is, and therefore in a table of results, often the match or ‘hit’ with the smallest E value is accepted as the most similar hit. It is safe to assume this because the E value decreases exponentially as the score increases (NCBI 2011b).
To locate the orphan genes however, significantly low values will not be of use. Instead, higher values which indicate disparity will be used. Any similarity observed is unlikely to be caused by homology i.e. false positive matches.
Simply revealing orphan genes of N. caninum and T. gondii will not be enough to determine the genes involved in the reactivation of quiescent N. caninum bradyzoites during cattle pregnancy. The results must be investigated further to determine their localisation and function within the cell. This must then be interpreted to select candidate genes which may be involved in the reactivation mechanism.
Determining Subcellular Localisation
The Center for Biological Sequence Analysis (CBSA) has developed a number of tools accessible via the web which allow for the subcellular localisation of proteins. The procedure for using these protein subcellular localisation prediction methods is described in the protocol outlined by Olof Emanuelson et al. (Emanuelsson et al. 2007). This protocol was selected for use with our results from BLAST as it is hypothesised that the proteins involved in the reactivation mechanism will likely contain signalling peptides, be involved in the signalling process or form transmembrane helices. This was suggested because it is thought hormonal changes during pregnancy may be involved in the reactivation of the parasite. This could be detected via the presence of expressed proteins of the candidate genes. These proteins must therefore be able to communicate with extracellular processes. As mentioned above, the CBSA protocol can determine localisation of proteins which will thus aid the determination of whether our candidate genes and their proteins are responsible for or involved in the reactivation of the quiescent bradyzoites.
The candidate genes and thus their proteins revealed by using BLAST will undergo this protocol which requires that initially each protein sequence is passed through the TargetP 1.1 server. TargetP 1.1 predicts the subcellular location of eukaryotic proteins. The location assignment is based on the predicted presence of any of the N-terminal presequences; mitochondrial targeting peptide (mTP) or secretory pathway signal peptide (SP) (Center for Biological Sequence Analysis 2011). Following this the sequences are passed through SignalP which predicts the presence and location of signal peptide cleavage sites in amino acid sequences. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models (Bendtsen et al. 2004). Finally the protocol suggests the use of their TMHMM server v2.0 (Krogh et al. 2001) which uses an algorithm to detect whether the sequence is likely to contain transmembrane helices or not.
Perl & MySQL
The development of a web-based database from scratch requires prior knowledge of a number of computer programming languages, you may also need help from the Salesforce experts. Languages used in the development of this web-based database included perl and MySQL. Perl is a general purpose, dynamic programming language developed in 1987; it is a versatile language with over 21,000 additional modules available to enhance functionality (Hansen 2011). MySQL is an open source relational database management system, developed in 1995. Such a system allows the storage of data in the form of tables (in a similar manner to standard spreadsheet software) as well as the relationship between values within the tables. MySQL and perl are often used together due to the fact that they integrate with each other well; the final website relies heavily on this factor.
User Interface
For ease and simplicity of use, web pages must also be used in conjunction with MySQL and perl, thus requiring knowledge of the languages HTML and CSS. Having simple to understand and easy to use webpages on top of perl or MySQL containing scripts, allows the user to take a much less intimidating route to retrieving data from the databases.
A typical method of data retrieval would be like that depicted in figure 2 below. The user enters the web address of the site where, upon arrival, they are greeted by informative webpages. If the user wishes to pull information out of the database, they can make and submit their selections from the webpage. Upon doing so, the information entered by the user is inserted into a perl script. One of the script’s roles is to create a statement which can be sent to MySQL in a form which it both understands and can use to return relevant information to the user. Integration between MySQL and perl then creates a webpage displaying the information the user requested.
Development Methods & Protocols
Development Environment
ApiBLAST was developed under Mac OSX 10.6.7 running; MySQL (version 5.5.9), perl (version 5.10.0) [Necessary modules: DBD-mysql-4.018, DBI-1.616, CGI.pm-3.54 plus all prerequisites], NCBI Blast standalone (legacy version 2.2.24), TextWrangler (Bare Bones Software Inc. Version 3.5.3) and CSSEdit (macrabbit Version 2.6.1).
Creating the Database
To determine which genes could be responsible for the reactivation of N. caninum during pregnancy a database of apicomplexan genes was required. The theory is that orphan genes (i.e. those with potential responsibility) of N. caninum will not be found elsewhere within the apicomplexa. To locate such genes, a database of all the known and sequenced genomes of the apicomplexa was curated from multiple online sources including; ToxoDB (The EuPathDB Project Team 2011) and GeneDB (Sanger Institute 2010). A number of model organisms were also added to the database as suggested by the National Institutes of Health (NIH) (Francis 2011). See Appendix 1 for the complete list of species used in the database. For each individual species, the entire sequenced proteome was downloaded, this comprised every gene ID and the protein sequence for which it encodes. Species proteomes were stored in FASTA format (see box 1) and compiled into an individual file.
A separate database containing only N. caninum and T. gondii genes was created alongside the complete database. Slightly more information was obtained about these two key species however, compared to Gene ID and the predicted protein sequence of the other species:
Table 1 – Overview of the information gathered on Neospora caninum and Toxoplasma gondii
Information | Summary |
Gene ID | Gene ID for each protein sequence |
Molecular Weight | Weight of protein sequence in Daltons |
Isoelectric Point | pH at which molecule carries no net charge |
Signal P Scores | Score given by SignalP |
Signal P Peptide | Signal protein sequence |
Annotated GO Function | Annotated gene ontology of protein function |
Annotated GO Process | Annotated gene ontology of protein process |
Annotated GO Component | Annotated gene ontology of protein component |
Predicted GO Function | Predicted gene ontology of protein function |
Predicted GO Process | Predicted gene ontology of protein process |
Predicted GO Component | Predicted gene ontology of protein component |
Predicted Protein Sequence | Protein sequence of the gene |
Running BLAST
BLAST is available to download as a standalone executable which can be run from any PC (NCBI 2011c). The advantage of running BLAST locally (as opposed to on a web server hosted by NCBI for example) is that it there is greater flexibility over its usage. In this case, version blast-2.2.24 was installed locally. Once installed, it became possible to run BLAST using the two previously created databases (N. caninum & T. gondii and the complete database) thus allowing the comparison of every gene in the N. caninum & T. gondii database against every gene of the complete database.
To run BLAST locally however, it was necessary to create a perl script to link the executable to user input. This allowed for the selection of a FASTA format database (the complete database – from which BLAST then creates the appropriate format) and the query (N. caninum & T. gondii database). The perl script was also required to inform the BLAST executable which program to use (either blastn for nucleotide sequences or blastp for protein sequences), as this project was comparing protein sequences against one another, the ‘blastp’ program was used. Further options must also be defined in the script as well, including the output format, output location and more importantly the e-value. See Appendix 2 – Sample BLAST Perl Script for the detailed script used.
In the script, the e-value of ‘1’ was defined, this meant results with an e-value of up to 1 (a relatively high value) would be returned in the output. As a result the initial BLAST search revealed over 58,000 N. caninum ‘hits’ (i.e. proteins of the N. caninum proteome which when compared and aligned with proteins of the complete database during the running of BLAST, gave an e-value of up to 1) and almost twice as many T. gondii hits. Of all these hits however, only a few hundred would be relevant i.e. have a large enough e-value to be considered dissimilar from the apicomplexa.
Initial Development of the ApiBLAST Website
The BLAST perl script was the first script to be integrated into the ApiBLAST website, forming the main page. The home page of ApiBLAST can be seen in full in Appendix 3 – Home Page of ApiBLAST. The structure of the website was built at a relatively early stage of the project to allow the addition of later data with ease. ApiBLAST consists of four main pages; Home (shown in appendix 3), databases (which describes the species within the database, as in appendix 1), about (which gives a little information about the project and how to use the website) and query. The query page was developed later.
An example of the code required to create webpages using HTML can be found in Appendix 4 – Sample Perl/HTML Used to Create the ApiBLAST Home Page. Traditional webpages are coded in HTML alone; however the majority of webpages in ApiBLAST are a combination of HTML & perl. This allows webpages to perform more complicated tasks (such as dynamically receive content from a MySQL server and then display the information in a table).
Aesthetics of a website which is to be presented to other users can be just as important as functionality. The webpages comprising ApiBLAST were styled using another computer language, CSS (cascading style sheets), which grants control over the looks of the webpages. The full code required for the styling of ApiBLAST can be found in Appendix 5 – The CSS Styles Applied to ApiBLAST.
With the structure and aesthetics of ApiBLAST in place and the BLAST script created, the next step in development of the website was to link the BLAST script to the webpage. A summary of how this works can be found in figure 3.
In this process, there is no need to contact a MySQL server, the databases are hosted as FASTA files locally and the query is entered by the user. The website uses a modified version of the BLAST script located in appendix 2 (the script in appendix 2 is for use on a local machine at the command prompt) hence the slight differences.
The process outlined in figure 3 is as follows:
- A user defines the genes and their protein sequences to be queried and enters them (in FASTA format) into the sequence box
- The database against which these are to be compared and the e-value are also defined by the user – No further user input is required
- Clicking the ‘BLAST sequence’ button initiates a perl script, ‘blast.pl’
- This script uses the perl CGI module to take user submitted information on the webpage and import it temporarily into the perl script
- The script then executes the blastp (protein vs. protein) program on the server machine i.e. where the website is hosted
- Blastp compares the protein sequence of every gene entered in the query box against the proteome of every species in the database
- If a sequence has a score which yields an e-value of 1 or less (e-value is dependent on the MSP score and the length of the protein sequences) the queried protein, the protein match within the database and the e-value are recorded in a temporary file
- Blastp continues to run until every queried protein has been compared against protein within the database
- The results are returned in a temporary HTML page which is displayed in the user’s browser
Using this method, it was possible to query the T. gondii and N. caninum proteomes against complete database of species (see appendix 1). The results of this were then stored in a MySQL database.
Storing Data Using MySQL
With >90% of the hits obtained from the initial BLAST search being irrelevant to the aim of this project, a quick and efficient way to both store and filter the results was required. For a smaller scale task, simple spreadsheet software may have been sufficient. However in this case a more robust tool was required, the relational database management software MySQL was used. An overview of MySQL can be found in figure 4.
A MySQL database (ApiBLAST) was created and populated with a number of tables. The tables within the database were used to hold all the information created during the course of the project. At this initial stage 4 tables were created; BLAST, Complete, Neospora and Toxoplasma – a further table (SignalP) was added later. For an overview of the contents of each table, refer to figure 5. The MySQL database was hosted locally during development and moved to a remote server once complete, allowing access via the web interface from any computer.
To create the MySQL databases and tables as per the structure in figure 5, the MySQL language was used, examples of which can be found in Appendix 6 – Sample MySQL Language Structure. Once the tables were created, it was possible to import the results of the BLAST into its corresponding table (the tables; Complete, Neospora and Toxoplasma had been populated prior to this).
Once data is stored within MySQL, it is possible to access it via the web using a combination of the perl (via the CPAN modules DBI and DBD::MySQL), HTML and MySQL languages. It is possible to build a form using HTML which allows the user to send a query to the MySQL database, mediated by a perl script. The HTML form again uses the perl CGI module to retrieve the user’s query from a form on a webpage thus enabling querying of the MySQL database from a remote machine and without the need to use the command prompt. All MySQL database queries are performed from the ‘Query’ page of ApiBLAST (see Appendix 8 – Query Page of ApiBLAST).
The process of performing a MySQL query (outlined earlier in figure 4) is as follows:
- The user inputs a query via a form on a webpage (for examples of the types of forms included in the ApiBLAST website, see Appendix 8 – Query Page of ApiBLAST)
- Clicking the ‘Perform Query’ button initiates the relevant query perl script (for example ‘blastquery.pl’) an example script used to connect to a MySQL database is shown in Appendix 7 – Sample Perl Script Required to Connect to MySQL.
- The perl CGI module retrieves the information entered by the user in the web form and inputs this temporarily into the perl script
- The same script then uses the DBD::MySQL and DBI modules to connect remotely to the MySQL database
- Perl then formulates a MySQL language-coherent query from the information passed to it by the CGI module
- This query is executed in MySQL, which returns the relevant data. This data is passed back to perl
- HTML embedded in the perl script formulates a webpage which displays the results of the MySQL query in a clear, tabulated manner to the user’s web browser
Retrieving Results from BLAST
With all the relevant data stored in a MySQL database and a website infrastructure which allows access to the data, it was then possible to retrieve relevant results from the initial BLAST search via the web. By navigating the ‘Query’ page of ApiBLAST it is possible to filter the results of the BLAST search by selecting minimum and maximum e-values to display. A higher e-value can be interpreted as; a weaker homology between the queried gene (which would belong to either N. caninum or T. gondii) and the hit gene (which would likely be from another species within the apicomplexa). Considering this, the results of the initial BLAST search, which consisted of many thousands, was reduced to a few hundred by selecting only the results with an e-value between 0.7 and above. The resulting genes could thus be considered orphan genes because according to their e-values, they display no homology to other genes within the apicomplexa.
Determining Subcellular Localisation Using the SignalP Protocol
Olof Emanuelson et al. (Emanuelsson et al. 2007) from the CBSA has outlined a protocol for determining the subcellular localisation of an expressed protein, this procedure was adapted and used in the subsequent steps. Following the protocol enables the determination of whether a protein is a signal protein or not, and because it is a prediction, it also tells how reliable that prediction is. Determination of subcellular localisation is important as it provides major clues towards the function and characteristics of a protein (Emanuelsson et al. 2007).
The procedure mentioned above was performed on each gene from the list of the relevant results retrieved by filtering the initial BLAST results, it is described below:
- For eukaryotic sequences such as N. caninum and T. gondii ‘TargetP’ (TargetP 1.1 – http://www.cbs.dtu.dk/services/TargetP/) (Emanuelsson et al. 2000). Must be used prior to SignalP.
- Insert the protein sequence of the gene(s) in question, into the corresponding text box on the TargetP website. It is not necessary to change any of the settings; it is even suggested against by the authors
- Clicking the submit button runs TargetP on the inserted protein sequence(s) (Multiple sequences can be submitted by inserting them all into a FASTA file)
- Once TargetP has completed it will return a predicted localisation which can be either:
- M – Protein located mitochondrially
- S – Protein has a signal peptide
- _ – Protein is localised elsewhere within the cell
- TargetP will also display a ‘reliability coefficient’ (RC), ranging from 1-5, this value defines how confident the TargetP protocol is in its prediction. 1 defines a prediction which is highly reliable and yields almost zero false positives, whilst 5 denotes a prediction which is unreliable and many false positives may have been detected
- After processing with TargetP, it is necessary to use SignalP (SignalP 3.0 – http://www.cbs.dtu.dk/services/SignalP/) (Bendtsen et al. 2004). Navigate to the SignalP page and select the organism group of eukaryotes, leave the other settings at their defaults and paste in or upload the protein sequences as with TargetP. When done, hit submit to begin processing
- For each protein sequence submitted, 2 sets of results will be returned; a ‘SignalP-NN’ (neural network) result and a ‘SignalP-HMM’ (hidden Markov models) result. For this project, the SignalP-HMM results can be ignored
- For SignalP-NN, four important scores are given which contribute to the reliability of whether a protein is a signal protein or not, these are:
- S-Score – an estimated probability (0-1) of a certain point within the protein having a signal peptide. This information is normally shown graphically, only the maximum S value was recorded in the results
- C-Score – the estimated probability of the position being the first in the mature protein. Again, this information is normally shown graphically so only the maximum value was recorded
- Y-Score – a combination of the geometric average of the C-score and a smoothed slope of the S-score. Simply, this gives the best estimate of where the signal protein is cleaved. Again this is typically shown graphically so only the maximum value was recorded
- D-Score – used to discriminate between signal proteins and non-signal proteins. It is calculated from the mean of the Y and S score and gives better discrimination between SPs and Non-SPs than the Y or S score alone
- The final step is to determine if the protein is an integral membrane protein, this can be done using a transmembrane α-helix predictor. For this procedure, TMHMM 2.0 (http://www.cbs.dtu.dk/services/TMHMM/) (Krogh et al. 2001) was used
- In a similar manner to TargetP and SignalP beforehand, the protein sequence(s) should be pasted or uploaded into the corresponding box, the default settings are suitable, hit submit to retrieve the results
- When the job is finished, the results page will give the number of predicted transmembrane helices for each protein sequence
Final Amendments to the ApiBLAST Website and MySQL Database
With all the information now gathered, it was possible to piece together the final pieces of the website and create the remaining MySQL table. To complete the ApiBLAST database, a final MySQL table ‘SignalP’ was created and populated with the results from the CBSA protocol described earlier (as shown in figure 5). Using the connectivity of HTML, perl and MySQL again, another query tool was added to the page ‘Query’ (this can be seen in Appendix 8 – Query Page of ApiBLAST on the far right). This allows a user to query the data in the MySQL table, ‘SignalP’ in a number of ways including; selection of a minimum and maximum RC, selecting localisation of results (i.e. M, S or _) or by defining the minimum number of transmembrane helices. The appropriate form used to query this information is shown in figure 6.
This query tool is thus very powerful as it can return results which fit the criteria of possible orphan genes involved, by some means, with the latent reactivation of N. caninum bradyzoites. By making the following selection a handful of candidate genes are returned:
- RC – Between 1 and 3 i.e. reliable predictions
- Localisation – Signal peptide
- Number of transmembrane helices – At least 1
Upon doing this, the results are returned in the web browser as seen in Appendix 9 – Results from a SignalP Query. It is now possible to analyse these genes further using other bioinformatics tools available on the web.
Method of Analysis
The results returned from the SignalP query were essentially the best candidates from all T. gondii and N. caninum genes for being involved in reactivation of latent bradyzoites during cattle pregnancy. Not only do they show no homology to other genes of the apicomplexa as shown by their relatively high e-values of 0.7 and above, they appear to have appropriate structure and subcellular localisation according to the SignalP results.
To determine the possible functions of these genes, there are a number of available website which provide tools for investigating proteins further. Only 12 genes were returned from the SignalP query, so it was possible to carry out this analysis by hand.
The NCBI BLAST website (available at http://blast.ncbi.nlm.nih.gov/Blast.cgi) (Altschul et al. 1997) should be visited first; here it is possible to enter the protein sequence of the gene and determine more information about it. By using a protein BLAST (or blastp) with the default settings, the top hit on the results page will be the gene of the protein sequence entered. Although this reveals no new information, clicking the gene name will reveal much more detail about it, an example of this is shown in Appendix 10 – Results Page from an NCBI BLAST Search for NCLIV_010400.
As well as the typical BLAST program, NCBI also hosts PSI-BLAST (position specific iteration basic local alignment search tool) (available at http://www.ebi.ac.uk/Tools/sss/psiblast/). PSI BLAST is much more sensitive in locating distant evolutionary relationships and as such is more sensitive to weak homologies (although biologically relevant) (Altschul et al. 1997).This makes it useful for locating protein families of significance.
Finally, the European Bioinformatics Institute (EBI) hosts a tool known as IntAct (available at http://www.ebi.ac.uk/intact/main.xhtml) (Aranda et al. 2010). IntAct allows the user to search their database by gene name to see if any protein interaction data is available. By searching IntAct using the results from the SignalP query performed on ApiBLAST, it would be possible to see if there were any interactions between the expressed proteins of the genes and pregnancy related hormones, for example oestrogen or progesterone.
The findings from these analyses were compiled and placed into table 2.
Results
Only 248 orphan genes were discovered from over 15,000 initial genes of T. gondii and N. caninum. Of these (98 belonging to N. caninum and 150 belonging to T. gondii) the ApiBLAST website was successfully able to locate 12 genes which met the criteria of the hypothesised reactivation gene. These 12 candidate genes can be found in table 2 below:
Table 2 – The twelve candidate genes which are returned when querying the ApiBLAST database using the following criteria; ‘Predicted Location’ – Signal peptide, ‘Prediction Reliability’ – Min. 1 & Max. 3 and ‘Number of transmembrane Helices’ – Min. 1
Gene ID |
Gene Definition |
Protein Family |
NCLIV_010400 |
Hypothetical Protein |
IPR016196 |
NCLIV_009890 |
Putative Transmembrane Domain-Containing Protein | |
NCLIV_010610 |
Hypothetical Protein |
IPR001727 |
NCLIV_011630 |
Hypothetical Protein | |
NCLIV_011380 |
Putative TB2/DP1, HVA22 domain-containing protein |
IPR004345 |
NCLIV_010350 |
Conserved Hypothetical | |
NCLIV_011230 |
Conserved Hypothetical | |
NCLIV_009800 |
Conserved Hypothetical | |
TGME49_109990 |
Conserved Hypothetical | |
TGME49_111720 |
Heat Shock Protein |
IPR001023 |
TGME49_113080 |
Hypothetical Protein | |
TGME49_113760 |
Hypothetical Protein |
IPR001841 |
The majority of these genes produce conserved hypothetical proteins, these are proteins whose presence is known but their function remains unknown i.e. perfect candidates for this investigation. Whilst the remaining proteins have either a defined function or a defined protein family, they do not have to be disregarded as candidates. For example, it could be thought that the Heat shock protein producing TGME49_111720 is unlikely to be involved with reactivation; however it is possible that the description line of the protein could be incorrect. Also possible is that the protein has functions other than those listed in either the description line or the description of any protein families the protein may contain.
Three of the hypothetical proteins have protein families listed in the EBI InterPro database (available at http://www.ebi.ac.uk/interpro/) which can help determine their function.
NCLIV 010400 – IPR016196
NCLIV 010400 is a member of the ‘Major facilitator Superfamily’ (MFS) domain (Pao, Paulsen & Saier Jr. 1998). Such proteins make up the second largest membrane transporters in the cell. They are capable of transporting small solutes in response to changes in chemiomotic ion gradients. Transporters of the MFS family can function as uniporters, antiporters or symporters (EBI 2011c).
The MFS domain is also found in glycerol-3-phosphate transporter of Escherichia coli, which transports glycerol-3-phosphate into the cytoplasm and inorganic phosphate into the periplasm (Huang et al. 2003). The E. coli proton/sugar transporter lactose permease (LacY) also carries this domain, and acts to couple lactose and H+ translocation (Abramson et al. 2003)(Mirza et al. 2006).
NCLIV 010610 – IPR001727
This protein family is currently uncharacterised, however regions of similarity are found in; Saccharomyces cerevisiae, Schizosaccharomyces pombe, Mus musculus and Synechocystis sp. (strain PCC 6803) (EBI 2011a).
TGME49 113760 – IPR001841
Members of the ‘Zinc finger RING-type’ protein family, contain domains which have relatively small protein motifs containing multiple finger-like protrusions which may bind zinc but not other metals such as iron (Klug 1999)(EBI 2011b).
ApiBLAST MySQL Database
To show how much data was collected and processed, the final amount of data in each MySQL table has been included; this data can be seen in table 3 below. Each row of data is the equivalent of a single gene. T. gondii has more entries than expected as the proteomes of multiple strains were included into a single table.
Table 3 – The amount of data stored in each table of the MySQL database ApiBLAST, where number of rows indicates number of gene entries and data length indicates the number of characters per entry
Table Name |
Total Number of Rows |
Total Length of Data |
Blast |
119,837 |
9,977,856 |
Complete |
1,067,204 |
778,043,392 |
Neospora |
7,945 |
8,929,280 |
Toxoplasma |
33,411 |
25,755,648 |
SignalP |
248 |
49,125 |
The ApiBLAST Website
The development of the ApiBLAST website is essentially the greatest result from this investigation. Each feature of the website is briefly defined below:
- Home (http://138.253.35.110/API-BLAST/cgi-bin/index.pl)
- Able to enter a protein sequence on the home page, you can then perform a BLAST query with the ability to alter both the database and e-value threshold
- Databases (http://138.253.35.110/API-BLAST/databases.html)
- A list of the entire species content of each database used in BLAST tool on the home page
- About (http://138.253.35.110/API-BLAST/about.html)
- Displays information relevant to the project
- Query (http://138.253.35.110/API-BLAST/query.html)
- Search Blast Results
- Allows the filtering of results held in the blast MySQL table by their e-value – Useful for determining homology between genes
- Search Blast Results
- Search the database
- Gene ID Keywords
- Look for keywords in gene names – useful for finding specific genes
- Description Keywords
- Look for keywords in gene descriptions – useful for finding only certain genes e.g. fluorescent returns all genes which are described as fluorescent
- Sequence Motifs
- Look for motifs within the entire predicted protein sequence of a gene
- Gene ID Keywords
- Query the molecular weight
- Return results from either the ‘toxoplasma’ or ‘neospora’ tables and filter the results by their molecular weight
- Search genes specific to T. gondii and N. caninum
- Returns information about T. gondii and N. caninum as this information is excluded from the ‘Search the Database’ section, the available information to be queried is the same as that listed in table 1
- Search the SignalP results
- Return information from the SignalP MySQL database, information can be filtered by predicted location, the reliability of the predication and the number of transmembrane helices
As you can see, the ‘Query’ page is able to perform a wealth of tasks – able to process and filter the contents of the MySQL database in a number of ways. Figure 7 gives an overview of the query page, showing the type of response delivered when filling in the different forms on the page. Each form links to a different perl script; this means there are seven different perl scripts executable from the query page. Below is a description of each segment within figure 7:
a. Query the BLAST results from the ‘blast’ MySQL table; this table contains the results from the BLAST performed on N. caninum & T. gondii genes vs. the genes of all species within the ‘complete’ MySQL table. Information can be filtered by defining the minimum and maximum e-value. Returned information includes:
Query Gene ID |
Gene ID of ‘Hit’ |
E-value of the Match |
b. Search by gene ID within the ‘complete’ MySQL table i.e. all the model organisms, apicomplexa, microsporidia and kinetoplastidia, thus excluding N. caninum & T. gondii. Returned information includes:
Gene ID |
Protein Description |
Predicted Protein Sequence |
c. Search by protein description of genes within the ‘complete’ MySQL table as above, locate keywords within the description for example: ‘length’, ‘hypothetical’ or ‘fluorescent’. Returned information includes:
Gene ID |
Protein Description |
d. Search by the predicted protein sequence of genes within the ‘complete’ MySQL table as above. Returned information includes:
Gene ID |
Protein Description |
Predicted Protein Sequence |
e. Query the molecular weight of the proteins expressed by N. caninum or T. gondii genes from the data stored in the ‘toxoplasma’ and ‘neospora’ MySQL tables. Returned information includes:
Gene ID |
Molecular Weight |
f. Because N. caninum & T. gondii are excluded from the queries made by b., c., and d. they can be searched using this alternative form which searches the data in the ‘toxoplasma’ and ‘neospora’ MySQL tables. The form searches by gene ID and has a number of variables by which information is displayed. In the form it is possible to select which information is displayed by using the checkboxes. If all checkboxes are selected, the following column headers are displayed:
I. Gene ID
II. Molecular Weight
III. Isoelectric Point
IV. Signal P Scores
V. Signal P Peptide
VI. Annotated GO Function
VII. Annotated GO Process
VIII. Annotated GO Component
IX. Predicted GO Function
X. Predicted GO Process
XI. Predicted GO Component
XII. Predicted Protein Sequence
The user can select to display any variation of the above in their results table
g. The final section of the ‘Query’ page allows users to search the MySQL table SignalP. This table contains all the results from analysing the genes in the ‘blast’ MySQL table with SignalP, TargetP and TMHMM. Results can be filtered by; their ‘RC’ values i.e. how reliable the prediction is, their ‘Predicted Location’ and their number of transmembrane helices (TMHs). Returned information includes:
Gene ID |
Predicted Location |
RC |
Signal P C-Score |
SignalP Y Score |
SignalP S score |
SignalP D Score |
No. Predicted TMHs |
Discussion
An initial amount of over 15,000 genes were reduced to 12 candidate genes which is a great reduction. The majority of these genes however have little known about them and their possible relation to N. caninum reactivation during pregnancy. IntAct, one of the websites suggested for analysing the candidate genes returned no results for any of their expressed proteins; this is probably due to a lack of information on these genes in the ‘IntAct’ database at present.
This project has shown that there are many possibilities for further investigation as a result of the finding of the 12 candidate orphan genes. Around half of the 12 candidate genes have functions which are unaccounted for; it was ApiBLAST which successfully located these genes.
What ApiBLAST has demonstrated is that; by using the protocol listed in the method, it is possible to refine information available from a number of online bioinformatics tool and use the information in a new way. Before the creation of ApiBLAST, it would not have been possible to even suggest that the hypothesised reactivation genes existed without the suggestion being entirely based on theory alone. ApiBLAST has utilised the wealth of information available on the internet and refined it in a manner which reinforces the hypothesis that the reactivation genes may in fact exist.
The results from this study are also by no means static, the development of the ApiBLAST website means data can be queried limitless times – each search with different criteria. The development of the ApiBLAST database and website could be regarded as of greater importance than the findings of the study. Whilst the development of the website was geared towards answering the question ‘Which genes are responsible for the reactivation of Neospora caninum during cattle pregnancy?’ this does not constitute the sole use of the website. Also, with very simple additions of extra scripts and data, the possibilities for ApiBLAST could be greatly enhanced. For example, more species could be added to the database, more ways to search the data could be created or greater flexibility/ greater automation could be added. For instance, the CBSA offers standalone versions of SignalP, TargetP and TMHMM. Integration of these standalone versions into the running of ApiBLAST would add much greater functionality and be relatively simple to implement.
ApiBLAST has given a reliable prediction of a handful of genes suitable for further study; within these genes could be a candidate gene(s) whose protein is likely involved with the reversion of N. caninum to a rapidly dividing state during cattle pregnancy. This may have possibly opened the path to the development of a potential vaccine which could save the cattle industry great losses. The amount of time it would have taken to sift through the entire proteomes of N. caninum and T. gondii or even just N. caninum alone, in an attempt to manually locate a gene responsible for reactivation would have been an incredible time sink. In this sense ApiBLAST has reduced the time necessary to locate these genes by 2,500 fold. Even if the requirements for the candidate genes were loosened slightly (by altering the values used in the form shown in figure 6), for instance including all genes in the query regardless of reliability coefficient or number of TMHs this would return only 28 candidate genes, thus still offering a major (500-fold) reduction in the amount of time which would be spent searching for these genes manually amongst the complete proteomes.
The ease at which it is possible to perform such a search on ApiBLAST is also an impressive feat. To return exactly the same results as those obtained by this study takes no longer than 30 seconds, see this video on achieving this: http://apiblast.vetsci.co.uk/how-to.swf.
The ApiBLAST database (i.e. the MySQL table in which all the gene data and results etc. for the ApiBLAST website is stored) is never seen by the user as it works ‘in the background’ but is another accomplishment attained during this project. The sheer size of the database is shown in table 3, which succinctly demonstrates the effort put in to compiling all of the data. The effort put in to its creation is what allows users of the ApiBLAST website to perform searches in a matter of seconds rather than the many hours it would take to source, compile and search on a local machine.
In conclusion, this investigation has lead to the creation of a system, which in its current state, is easy to use, efficient and powerful. It has the ability to save vast amounts of time and perform a wide array of queries in a matter of seconds. It is build upon a rich database of raw genetic data as well as hand refined results obtained from processing via well respected online bioinformatic tools such as those run by the CBSA. To prove how effective ApiBLAST is and to show how it can be applied in modern bioinformatics requires only the description of what was achieved during this project; from over 15,000 genes of N. caninum and T. gondii, 12 genes (fitting hypothesised criteria of genes responsible for the reversion of N. caninum to an active tachyzoite stage during pregnancy) were located – a huge refinement.
References
Abramson, J., Smirnova, I., Kasho, V., Verner, G., Kaback, H.R. & Iwata, S. 2003, “Structure and mechanism of the lactose permease of Escherichia coli”, Science, vol. 301, no. 5633, pp. 610-615.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. 1990, “Basic local alignment search tool”, Journal of Molecular Biology, vol. 215, no. 3, pp. 403-410.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. 1997, “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs”, Nucleic acids research, vol. 25, no. 17, pp. 3389-3402.
Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean, I., Bridge, A., Derow, C., Feuermann, M., Ghanbarian, A.T., Kerrien, S., Khadake, J., Kerssemakers, J., Leroy, C., Menden, M., Michaut, M., Montecchi-Palazzi, L., Neuhauser, S.N., Orchard, S., Perreau, V., Roechert, B., van Eijk, K. & Hermjakob, H. 2010, “The IntAct molecular interaction database in 2010”, Nucleic acids research, vol. 38, no. SUPPL.1, pp. D525-D531.
Bendtsen, J.D., Nielsen, H., Von Heijne, G. & Brunak, S. 2004, “Improved prediction of signal peptides: SignalP 3.0”, Journal of Molecular Biology, vol. 340, no. 4, pp. 783-795.
Center for Biological Sequence Analysis 2011, 24/02/2011-last update, TargetP 1.1 Server. Available: http://www.cbs.dtu.dk/services/TargetP/ [2011, 04/28].
Dubey, J.P. 1999, “Neosporosis in cattle: Biology and economic impact”, Journal of the American Veterinary Medical Association, vol. 214, no. 8, pp. 1160-1163.
Dubey, J.P. & Lindsay, D.S. 1996, “A review of Neospora caninum and neosporosis”, Veterinary parasitology, vol. 67, no. 1-2, pp. 1-59.
EBI 2011a, , IPR001727 Uncharacterised protein family UPF0016. Available: http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001727 [2011, 04/20].
EBI 2011b, , IPR001841 Zinc finger, RING-type. Available: http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001841 [2011, 04/20].
EBI 2011c, , IPR016196 Major facilitator superfamily domain, general substrate transporter. Available: http://www.ebi.ac.uk/interpro/IEntry?ac=IPR016196 [2011, 04/20].
Emanuelsson, O., Brunak, S., von Heijne, G. & Nielsen, H. 2007, “Locating proteins in the cell using TargetP, SignalP and related tools”, Nature Protocols, vol. 2, no. 4, pp. 953-971.
Emanuelsson, O., Nielsen, H., Brunak, S. & Von Heijne, G. 2000, “Predicting subcellular localization of proteins based on their N-terminal amino acid sequence”, Journal of Molecular Biology, vol. 300, no. 4, pp. 1005-1016.
Francis, S.C. 2011, , Model Organisms for Biomedical Research [Homepage of NIH], [Online]. Available: http://www.nih.gov/science/models/ [2011, 04/20].
Hansen, A.B. 2011, , The Perl Programming Language. Available: http://www.perl.org/ [2011, 04/30].
Huang, Y., Lemieux, M.J., Song, J., Auer, M. & Wang, D.-. 2003, “Structure and mechanism of the glycerol-3-phosphate transporter from Escherichia coli”, Science, vol. 301, no. 5633, pp. 616-620.
Klug, A. 1999, “Zinc finger peptides for the regulation of gene expression”, Journal of Molecular Biology, vol. 293, no. 2, pp. 215-218.
Krogh, A., Larsson, B., Von Heijne, G. & Sonnhammer, E.L.L. 2001, “Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes”, Journal of Molecular Biology, vol. 305, no. 3, pp. 567-580.
McAllister, M.M., Dubey, J.P., Lindsay, D.S., Jolley, W.R., Wills, R.A. & McGuire, A.M. 1998, “Dogs are definitive hosts of Neospora caninum”, International journal for parasitology, vol. 28, no. 9, pp. 1473-1478.
McAllister, M.M., Huffman, E.M., Hietala, S.K., Conrad, P.A., Anderson, M.L. & Salman, M.D. 1996, “Evidence suggesting a point source exposure in an outbreak of bovine abortion due to neosporosis”, Journal of Veterinary Diagnostic Investigation, vol. 8, no. 3, pp. 355-357.
Mirza, O., Guan, L., Verner, G., Iwata, S. & Kaback, H.R. 2006, “Structural evidence for induced fit and a mechanism for sugar/H+ symport in LacY”, EMBO Journal, vol. 25, no. 6, pp. 1177-1183.
NCBI 2011a, 26/04/2011-last update, Apicomplexa – Protein Results. Available: http://www.ncbi.nlm.nih.gov/protein?term=apicomplexa [2011, 26/04].
NCBI 2011b, 27/04/2011-last update, BLAST Frequently Asked Questions. Available: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect [2011, 04/27].
NCBI 2011c, 02/01/2011-last update, Download BLAST Software. Available: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download [2011, 04/20].
Pao, S.S., Paulsen, I.T. & Saier Jr., M.H. 1998, “Major facilitator superfamily”, Microbiology and Molecular Biology Reviews, vol. 62, no. 1, pp. 1-34.
Sanger Institute 2010, , GeneDB. Available: http://www.genedb.org/Homepage [2011, 04/20].
Trees, A.J. & Williams, D.J.L. 2005, “Endogenous and exogenous transplacental infection in Neospora caninum and Toxoplasma gondii”, Trends in parasitology, vol. 21, no. 12, pp. 558-561.
Appendices
Appendix 1 – Species Included in the Complete Database on the ApiBLAST Website
Apicomplexa
Cryptosporidium hominis
Cryptosporidium muris
Cryptosporidium parvum
Plasmodium berghei
Plasmodium chabaudi
Plasmodium falciparum
Plasmodium knowlesi
Plasmodium vivax
Plasmodium yoeli
Eimeria tenella
Theileria annulata
Kinetoplastidia
Leishmania braziliensis
Leishmania infantum
Leishmania major
Leishmania mexicana
Trypanosoma brucei
Trypanosoma congolense
Trypanosoma cruzi
Trichomonas vaginalis
Trypanosoma vivax
Microsporidia
Encephalitozoon cuniculi
Encephalitozoon intestinalis
Enterocytozoon bieneusi
Model Organisms
Felis catus
Gallus gallus
Bos taurus
Canis familiaris
Tursiops truncatus
Drosophila melanogaster
Gorilla gorilla
Cavia porcellus
Equus caballus
Homo sapiens
Mus musculus
Sus scrofa
Pryctolagus caniculus
Rattus norvegicus
Danio rerio
Arabidopsis Thalinia
Caenorhabditis elegans
Dictyostelium discoideum
Daphinia pulex
Neurospora crassa
Saccharomyces cerevisiae
Saccharomyces pombe
Xenopus tropicalis
Appendix 2 – Sample BLAST Perl Script
use strict;
use warnings;
#########################################
## Convert FASTA Files to BLAST format ##
#########################################
print “Enter the name of the .fasta file from which to create a BLAST database:\t”;
my $fileName = <STDIN>; # name of the file which is going to be converted into a blast format
chomp $fileName;
my $fastaFile = $fileName.“.fasta”;
unless (-e $fastaFile.“.pin”) { # -e checks if a database already exists, thus eliminating need to recreate
if (-e $fastaFile){ #only runs blast if the .fasta file exists
my $call = ‘C:\”Program Files”\NCBI\blast-2.2.24+\bin\makeblastdb.exe -in ‘.$fastaFile; #calls the blast makeblastDB program to create the needed files
system($call);
}
else {
print “That file does not exist”;
closePause();
exit;
}
}
else {
print “\n That database already exists, continuing with blast…”;
}
print “\n”;
pause();
##########################################################################
## Run the blastp using a user entered query against the above database ##
##########################################################################
### Asks for the file which will be queried:
print “Enter the name of the .fasta file to be queried against the ‘$fastaFile’ database:\t”;
my $tmp_input = <STDIN>;
chomp $tmp_input;
my $input = $tmp_input.“.fasta”;
### Creates a new file for the output:
print “\nWhere should the output file be saved? (e.g. output.csv):\t”;
my $userOut = <STDIN>;
chomp $userOut;
if (-e $userOut) {die “File already exists, please try again with a different file name \n”;}
open (USERFILE,“>$userOut”) || die “Cannot create file $userOut : $!”;
close (USERFILE);
print “This program may take a while to run, please wait…\n\n”;
### Performs blastp
if (-e $input) {
my $output = “$userOut”;
my $database = $fastaFile;
my $blast_location = ‘C:\”Program Files”\NCBI\blast-2.2.24+\bin\blastp.exe’;
my $command = “$blast_location -query $input -db $database -window_size 0 -evalue 1 -out $output -outfmt 10 “; #outfmt 6 for tab, 10 for csv -evalue <value>
system($command);
### Displays blastp within terminal (cmd)
open(BLASTOUTPUT,“$userOut”) || die “Cannot open $output : $!”;
while(my $line = <BLASTOUTPUT>){
my @temp = split(/,/,$line);
my $query = $temp[0];
my $hit = $temp[1];
my $eval = $temp[10];
if($eval < 0.1){
print “Queried Protein: $query \t Similar Protein:$hit\t e-value:$eval\n”;
}
}
}
else { #Closes if database name is incorrect
print “Incorrect database source”;
closePause();
exit;
}
print “\nThe program ran successfully! ‘$userOut’ has been created in the current directory. \a”;
closePause();
### Subroutines not shown ###
Appendix 3 – Home Page of ApiBLAST
Appendix 4 – Sample Perl/HTML Used to Create the ApiBLAST Home Page
#!/usr/bin/perl
use strict;
print “Content-type: text/html\n\n”;
print <<EOM;
<html>
<head>
<title>ApiBLAST</title>
<link rel=“stylesheet” type=“text/css” href=“/reset.css”/>
<link rel=“stylesheet” type=“text/css” href=“/style.css”/>
</head>
<body>
<div id=“container”>
<a href=“http://liv.ac.uk”><img id=“logo” src=“/logo.png” /></a>
<div id=“nav”>
<ul class=“blue”>
<li><a href=“/cgi-bin/index.pl” title=“home” class=“current”><span>home</span></a></li>
<li><a href=“/databases.html” title=“databases”><span>databases</span></a></li>
<li><a href=“/about.html” title=“about”><span>about</span></a></li>
<li><a href=“/query.html” title=“query”><span>query</span></a></li>
</ul>
</div>
<div id=“intro”></br>
<h1>ApiBLAST</h1>
<p>Use BLAST to search this protein database. Enter your search below in fasta format and select the correct applicable values. This will then perform a BLAST against our database </p></br></div>
<h2>Enter sequence:</h2>
<p class=“desc”>Copy and paste your protein sequence into the box below. The sequence should preferably be in <a href=“http://en.wikipedia.org/wiki/FASTA_format”>FASTA file format</a>. You should then select from the options below to refine your search, or just hit the BLAST sequence button at the bottom of the page, to begin using the default values.</p></br></br>
<form id=“form” action=‘blast.pl’ method=‘get’>
<textarea class=“textarea” name=‘seq’ style=“width:718px” rows=8></textarea><br/><br/>
<input id=“blastbutton” type=‘submit’ value=‘BLAST with Defaults’ />
<br/><hr/><br/><br/>
<h2>Select Database:</h2>
<p class=“desc”>There are a number of predefined databases to select from, for more information about what species are present in each database, please refer to the <a href=“/databases.html”>database page.</a></p><br/>
<p id=“radio”><input id=“radio” type=“radio” name=“db” value=“/library/webserver/CGI-Executables/databases/complete.fasta” checked /> Complete Database</p>
<p id=“radio”><input id=“radio” type=“radio” name=“db” value=“/library/webserver/CGI-Executables/databases/apicomplexaDB.fasta” /> Apicomplexa </p>
<p id=“radio”><input id=“radio” type=“radio” name=“db” value=“/library/webserver/CGI-Executables/databases/apicomplexa+.fasta” /> Apicomplexa+ </p>
<p id=“radio”><input id=“radio” type=“radio” name=“db” value=“/library/webserver/CGI-Executables/databases/Neosporacaninumselection.fasta” /> Neospora Caninum </p>
<p id=“radio”><input id=“radio” type=“radio” name=“db” value=“/library/webserver/CGI-Executables/databases/test.fasta” /> Test </p><br/><br/><br/><br/><hr/><br/><br/>
<h2>Select e-value Threshold:</h2>
<p class=“desc”>The e-value is a value given to a protein ‘hit’ which describes how similar it is to the protein query. A smaller value indicates a higher relationship with 0 meaning both proteins are the same. See more on <a href=“http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect”> e-values here </a>. Set the threshold below to prevent protein hits with higher e-values being included in your results.</p></br>
<input id=“textfield” “type=”text” name=“eval” value=“0.1” size=“3” /><br/><br/><br/><br/>
<input id=“blastbutton” type=‘submit’ value=‘BLAST Seqeunce’ />
</form>
<br/><br/><br/><div id=“footer”><p id=“footertext”><a href=“http://vetsci.co.uk”>© James Watts</a></p></div><br/>
</div>
</body>
</html>
EOM
;
Appendix 5 – The CSS Styles Applied to ApiBLAST
h3, h4, h5, p{
color: #424242;
font-family: “Helvetica Neue”, Arial, Helvetica, Geneva,sans-serif;
letter-spacing: 0.5px;
padding-left: 17px;
line-height: 30px;
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
#smallp {
font-size: 12px;
line-height: 15px;
padding: 5px 55px;
margin-right: 15px;
text-align: justify;
font-style: italic;}
.desc {
color: #727373;
font-family: “Helvetica Neue”, Arial, Helvetica, Geneva,sans-serif;
letter-spacing: 0.5px;
padding: 0 20px;
text-align: justify;
line-height: 30px;
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
.desc a, #smallp a, p a {
text-decoration: none;
color: #3e3f3f;}
.desc a:hover, #smallp a:hover , p a:hover {
text-decoration: none;
color: red;}
body {
background-color: #f2f1ed;
text-align: center;}
h1 {
color: #424242;
font-family: “Helvetica Neue”, Arial, Helvetica, Geneva,sans-serif;
font-size: 40px;
letter-spacing: 0.5px;
padding-top: 15px 0 17 20px;
line-height: 30px;
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
#nav {
clear: both;}
#nav ul {
padding: 0px;
margin: 10px 0;
list-style: none;
float: left;
width: 750px;
border-bottom: 2px #969696 solid;
border-top: 2px #969696 solid;}
#nav ul li {
float: left;
display: inline; /*double margin IE6*/
margin: 10 50px;
color: #969696;}
#nav ul li a {
text-decoration: none;
float:left;
color: #818383;
cursor: pointer;
font: 400 20px/22px “Helvetica Neue”, sans-serif;
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
#nav ul li a:hover span {
text-decoration: none;
float:left;
color: red;
cursor: pointer;
#nav ul li a span {
margin: 0 10px 0 -10px;
padding: 1px 8px 5px 18px;
position: relative; /*To fix IE6 problem (not displaying)*/
float:left;}
h2 {
color: #424242;
letter-spacing: 0.5px;
text-shadow: #e1e1e1 2px 2px 5px;
padding-left: 15 0 10 17px;
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);
font: bold 20px/30px “Helvetica Neue”, Arial, Helvetica, Geneva, sans-serif;}
#logo {
padding: 15px;}
#form {
padding-left: 0px;}
#blastbutton {
font-family: Helvetica, sans-serif;
font-size: 14px;
color: #424242;
padding: 5px 10px;
margin-top: 10px;
display: block;
margin-right: auto;
margin-left: auto;
background: -moz-linear-gradient(
top,
#ffffff 0%,
#ffffff 50%,
#d6d3ce);
background: -webkit-gradient(
linear, left top, left bottom,
from(#ffffff),
color-stop(0.50, #ffffff),
to(#d6d3ce));
border-radius: 10px;
-moz-border-radius: 10px;
-webkit-border-radius: 10px;
border: 1px solid #CDC9C9;
-moz-box-shadow:
1px 1px 3px rgba(000,000,000,0.5),
inset 0px 0px 3px rgba(255,255,255,1);
-webkit-box-shadow:
1px 1px 3px rgba(000,000,000,0.5),
inset 0px 0px 3px rgba(255,255,255,1);
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
#blastbutton:hover {
font-family: Helvetica, sans-serif;
font-size: 14px;
color: red;
padding: 5px 10px;
margin-top: 10px;
display: block;
margin-right: auto;
margin-left: auto;
background: -moz-linear-gradient(
top,
#ffffff 0%,
#ffffff 50%,
#d6d3ce);
background: -webkit-gradient(
linear, left top, left bottom,
from(#ffffff),
color-stop(0.50, #ffffff),
to(#d6d3ce));
border-radius: 10px;
-moz-border-radius: 10px;
-webkit-border-radius: 10px;
border: 1px solid #CDC9C9;
-moz-box-shadow:
1px 1px 3px rgba(000,000,000,0.5),
inset 0px 0px 3px rgba(255,255,255,1);
-webkit-box-shadow:
1px 1px 3px rgba(000,000,000,0.5),
inset 0px 0px 3px rgba(255,255,255,1);
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
#info {
padding-top: 20px;
padding-left: 50px;}
#intro {
clear: both;
padding-right: 15px;}
hr {
width: 748px;}
#container {
-moz-box-shadow: 0px 0px 20px #333;
-webkit-box-shadow: 0px 0px 20px
.textarea {
margin: 0 17px;}
#radio {
font-size: 18px;
line-height: 35px;}
#textfield {
margin-left: 356px;}
#footertext {
background-color: #e4dee1;
color: #424242;
text-align: center;}
#footertext a {
text-decoration: none;
color: #424242;
font-size: 8px;}
#footertext a:hover {
text-decoration: none;
color: red;
font-size: 8px;}
#datalist li {
font-family: “Helvetica Neue”, Arial, Helvetica, Geneva, sans-serif;
color: #424242;
margin-left: 40px;
line-height: 25px;}
.tier1{
font-size: 25px;
padding: 15px 0px;
font-weight: bold;}
.tier2 {
text-indent: 20px;
font-size: 20px;
padding: 5px 0px;}
.tier3 a, h1 a {
text-decoration: none;
color: #424242;}
.tier3 a:hover, h1 a:hover {
color: red;
text-shadow:
0px -1px 0px rgba(000,000,000,0.2),
0px 1px 0px rgba(255,255,255,1);}
.tier3 {
text-indent: 50px;
list-style-type: disc;
list-style-position: inside;}
/* Tables */
table {
border: 4px solid #424242;
table-layout: fixed;
overflow: hidden;
background-color: #fff;
width: 100%;}
th {
text-align: center;
font-family: helvetica;
padding: 10px;
color: #424242;
border: 2px solid #424242;
background-color: #fff;
word-wrap: break-word;
white-space: normal;}
#tableDesc {
text-align: left;
font-family: helvetica;
padding: 10px;
color: #424242;
border: 2px solid #424242;
background-color: #fff;
word-wrap: break-word;
white-space: normal;}
#tableSequence {
text-align: left;
font-size: 10px;
font-family: helvetica;
padding: 10px;
color: #424242;
border: 2px solid #424242;
background-color: #fff;
word-wrap: break-word;
white-space: normal;}
td {
overflow: hidden;
vertical-align: middle;
font-family: helvetica;
padding: 10px;
text-align: left;
color: #424242;
border: 2px solid #424242;
background-color: #fff;}
tr {
vertical-align: middle;
font-family: helvetica;
}
#table {
margin: 10px 25px;
width: 95%;}
Appendix 6 – Sample MySQL Language Structure
###Creating a database
create database apiblast;
###Creating tables
create table toxoplasma (Gene_ID varchar(20), Molecular_Weight integer(15), Isoelectric_Point varchar(100), SignalP_Scores varchar(100), SignalP_Peptide varchar(200), Annotated_GO_Function text, Annotated_GO_Process text, Annotated_GO_Component text, Predicted_GO_Function text, Predicted_GO_Process text, Predicted_GO_Component text, Predicted_Protein_Sequence text);
create table complete (Gene_ID varchar(70), Description text, Sequence text);
create table signalp (Gene_ID varchar (40), Target_P_Prediction varchar(3), Target_P_RC int(1), Signal_P_C real, Signal_P_Y real, Signal_P_S real, Signal_P_D real, No_Predicted_TMH int(2));
###Loading data into a MySQL table
load data local infile “/Library/WebServer/CGI-Executables/databases/complete.csv” into table complete fields terminated by ‘,’ lines terminated by ‘\n’;
###Altering Tables
alter table toxoplasma modify column molecular_weight integer(15);
alter table complete modify column Gene_ID varchar(70);
###An example query (Where variables indicated by $ would be defined by the user)
SELECT * FROM signalp WHERE Target_p_prediction = $location AND No_Predicted_TMH >=$tmh AND Target_p_RC Between $lower and $upper
Appendix 7 – Sample Perl Script Required to Connect to MySQL
#!/usr/bin/perl
use strict;
use DBI;
use DBD::mysql;
use CGI;
###################################
### Receive CGI from Query Form ###
###################################
print “Content-type: text/html\n\n”;
my $query = CGI->new;
my $location = $query->param(‘location’);
my $upper = $query->param(‘upper’);
my $lower = $query->param(‘lower’);
my $tmh = $query->param(‘tmh’);
########################
### Connect to MySQL ###
########################
my $dsn = “dbi:mysql:apicomplexa:localhost”;
my $dbconnect = DBI->connect($dsn, “root”, “xbwvko1990”)
or die “Unable to connect: $DBI::errstr\n”;
#my $dsn = “dbi:mysql:apiblast:138.253.35.110”;
#my $dbconnect = DBI->connect($dsn, “root”, “proteome”)
#or die “Unable to connect: $DBI::errstr\n”;
############
### HTML ###
############
### HTML OMITTED ###
################################
### Puts Query Into a Table ###
################################
my $query = “SELECT * FROM signalp WHERE Target_p_prediction = $location AND No_Predicted_TMH >=$tmh AND Target_p_RC Between $lower and $upper” ;
my $query_handle = $dbconnect->prepare($query);
$query_handle->execute();
my ($gene, $targetp, $rc, $pc, $py, $ps, $pd, $tmh);
$query_handle->bind_columns(\$gene, \$targetp, \$rc, \$pc, \$py, \$ps, \$pd, \$tmh);
while($query_handle->fetch()) {
print “<tr> <td>$gene</td> <td id=\”tableDesc\”>$targetp</td> <td>$rc</td> <td>$pc</td> <td>$py</td> <td>$ps</td> <td>$pd</td> <td>$tmh</td> </tr>”;
}
######################
### Clean Up HTML ###
######################
### HTML OMITTED ###