Predicting Tumor Grade | R

Tools used in this project
Predicting Tumor Grade | R

About this project

As a biologist, I have always been fascinated by tumor development. A nucleotide is inserted or deleted, starting a chain-reaction that results in affected cells continuing to grow beyond their intended life span, and there you have it. Cancer. While there have been incredible advancements in the field in terms of treatment, a cancer diagnosis often means complete financial and personal upheaval for patients and their families. Machine Learning offers opportunities to reduce the cost of prognostic and diagnostic molecular tests that are out-of-reach for most patients as well as a chance for more targeted research.

In this project, I used four machine learning algorithms to predict brain tumor Grade based on the presence or absence of mutations in 15 genes.

Random Forest and Bagging were the most accurate models, followed by Boosting, then a simple Decision Tree

In addition to the well-established factors of Age and IDH1 mutation, the most deterministic Genes were PTEN, ATRX, and CIC.

Data Set

This data set was obtained from the UC Irvine Machine Learning Repository. There are 839 rows, each representing a patient with a brain tumor. The 23 Attributes are tumor Grade, source Project, Case_ID, Gender, Age at diagnosis, Primary diagnosis, Race, and presence/absence of mutations on 15 Genes. The two Grades are High-Grade GBMs and Lower-Grade LGGs. Glioblastoma multiforme (GBM) is a fast-growing brain/spinal cord tumor and is the most common type of primary malignant brain tumor in adults.

Tasci,Erdal, Camphausen,Kevin, Krauze,Andra Valentina, and Zhuge,Ying. (2022). Glioma Grading Clinical and Mutation Features. UCI Machine Learning Repository. https://doi.org/10.24432/C5R62J.

This project was created in the R language within RStudio, mainly using the features of the tidymodels' parsnip package. All figures were also created in R and the code can be found in my github repository. Data was cleaned, engineered for analysis, and split in to separate training (75% of the data) and test (25% of the data) sets.

Data Exploration

Below are the Genes ordered by how commonly they were found to have mutations in this data set.

The impact of IDH1 mutation was also confirmed by both rpart and Variable Importance plots.

IDH1 mutations are well established as a cause for numerous cancer types, as is Age. To explore further the role of the other genes in determining tumor Grade, the models were created without those variables, resulting in a deeper rpart decision tree. Each leaf on the bottom is labeled with a probability value on the left and a percentage of the data set on the right. For example, from the bottom-right, 24% of the patients had no PTEN mutation with the presence of an ATRX mutation. Those patients have a 0.88 probability of having a Low-Grade Glioma.


Classification Tree

A basic Classification Tree, created with the function, is robust to outliers and relatively fast to train.

The ROC Area Under the Curve as a summary of model performance on the test data is 86%, out of a possible 100%.


The diagonals of the confusion matrix identify the number of accurate predictions versus false positive and false negative predictions. In this case, the model accurately predicted Grade 81+86= 167 times, while there were 39+9= 48 incorrect predictions on the test set.

These models can have high variance and are prone to overfitting. We will next try a few ensemble methods in an effort to improve performance.


Ensemble models essentially combine the predictions of several models, in this case via Bootstrap Aggregation, or Bagging. The baguette package functions were used for this method.

In this case, the ROC AUC, accuracy, precision, recall, and f_measure were all 1, with no incorrect predictions on the test set confirmed by the confusion matrix.

CONFUSION MATRIX: all tumor Grades in the test set were predicted correctly

Random Forest

Random Forest is another ensemble of trees trained on bootstrapped samples, with a bit of extra randomness added to the model. Rather than considering all variables together, only a subset of predictors is considered for each split. The majority vote is the final prediction.

As with bagging, the ROC AUC, accuracy, precision, recall, and f_measure were all 1, with no incorrect predictions on the test set confirmed by the confusion matrix.


Boosting is another ensemble method that incorporates the Gradient Descent technique to Adaptive Boosting. After evaluating the first tree, the loss function is used to select and optimize predictions of observations that are difficult to classify. Subsequent trees then better classify those observations which were not initially well- classified. After training the model created with Tidymodel's xgboost package, a template grid was created for tuning the hyperparameters. The select_best() function filled the parameter placeholders with optimal values. See my github repository for the full code.

The ROC AUC for this model on the test set was 87%, with a confusion matrix similar to that of the Classification Tree model.

CONFUSION MATRIX: the model correctly predicted 166 tumor Grades and incorrectly, and incorrectly predicted 49

Model Comparison

All model predictions were combined to construct ROC Curves for each model. This is a visualization of the performance (sensitivity vs specificity) summarized by the AUC values of each model, where 1 (Bagging and Random Forest) is complete accuracy.


The Bagging and Random Forest models performed without fault on this data set, however that would likely not be the case with a larger data set. Of note is that the Bagging and Random Forest models outperformed the Boosting model. This may be attributable to outliers or the high dimensionality of the data. Boosting works particularly well with unbalanced data sets, but the Grades were fairly well balanced in this data. It is also possible that the hyperparameter tuning was not fully optimized. Perhaps a Boosted model with other parameters tuned may perform better.

It is also important to consider the nature of these Gene mutations, not just their presence. Future investigations could investigate molecular profiles available on these mutations to further elucidate the predictive capabilities of these mutations for prognostic and diagnostic purposes. In conjunction with stacking ensemble methods, this could lead to very strong real-world models and increased accessibility.

Discussion and feedback(0 comments)
2000 characters remaining