Leveraging Random Forest Machine Learning for Subgroup Classification of Medulloblastoma
Abstract
Medulloblastoma, the most prevalent malignant brain tumour in children, necessitates precise diagnostic methods due to its heterogeneous molecular subgroups. This study leverages Random Forest machine learning algorithms to classify medulloblastoma subgroups by analysing DNA methylation and gene expression data. Utilising the Gene Expression Omnibus dataset GSE85218 — comprising 763 primary MB samples — the study implements variance threshold feature selection for preprocessing. Models were evaluated based on precision, recall, F1 score, and accuracy — with the highest performance observed in models utilising Top 1% varied combined DNA methylation and gene expression data. Models performed similarly however, meaning only targeted gene expression and DNA methylation data are required for an accurate diagnosis. Gene Set Enrichment Analysis (GSEA) identified significant pathways related to neural processes, underscoring the tumour’s impact on neural development and function. Biomarkers were identified from the most important features identified by the ML model, with possible new biomarkers for subgroup diagnosis being discovered.