Date of Award


Document Type

Thesis campus only


Computer Science

First Advisor

Matthew Hibbs

Second Advisor

Roberto Hasfura


Diabetes is a long-standing disease caused by high blood sugar over a long period of time and one in every ten Americans has diabetes. The neural networks have gained attention in large-scale genetic research because of its ability in non-linear relationships. However, the data imbalance problem, which is caused by the disproportion between the number of disease samples and the number of healthy samples, will decrease the prediction accuracy. In this project, we tackle the data imbalance problem when predicting diabetes with genotype SNP data and phenotype data provided by UK BioBank. The dataset is highly skewed with healthy samples with the ratio of 20. We build a phenotype neural network and a genotype neural network, which uses the sampling techniques to counter the data imbalance problem before feeding the data to the neural networks. We found out that the phenotype neural network outperforms the genotype neural network and achieves 90% accuracy. We reach the conclusion that undersampling performs better than oversampling to counter the data imbalance problem in our dataset and the phenotype is better than the genotype when predicting diabetes. We also discover the key phenotype and genotype features that contributed most to our model prediction.