Date of Award


Document Type

Thesis open access


In recent years, the amount of digital data that we produce has increased exponentially. This flood of information, often referred to as “big data,” is creating both opportunities and challenges in all areas of life. In the domain of biology, technology has enabled us to sequence the genomes of humans and many other organisms, but we are far from understanding the biological roles played by all of these genes. The Gene Ontology seeks to address this problem by annotating genes to terms describing biological processes, molecular functions, and cellular components. However, the ontology’s manual curators cannot keep up with the rate at which information is being discovered and published. Hence, there is a need for computational methods that can rapidly process the biomedical literature and suggest new annotations for verification. This study uses support vector machines to predict Gene Ontology annotations for Saccharomyces cerevisiae (yeast). I tested the usefulness of two types of literature features: co-occurrence of gene names in articles, and co-occurrence in abstracts of gene names with keywords taken from GO term definitions. My results demonstrate that support vector machines using literature co-occurrence data as features can predict GO annotations with high accuracy. In many cases where simple gene-gene co-occurrence does not work well, better results can be obtained using gene-keyword co-occurrence. I found that a very simple text mining strategy — identifying words that occur in only one GO term definition — was an effective way of choosing keywords. Although predictions based on gene-gene co-occurrence and those based on gene-keyword co-occurrence were highly correlated, there are terms for which one set of predictions was significantly more accurate than the other. I was able to combine the two sets of predictions effectively using a voting scheme in which gene-gene predictions were weighted at 70% and gene-keyword predictions at 30%.