Sciences and Mathematics, College of
Math and Computer Science, Department of
Missing data, for any reason, is a major frustration. This project presents a comparison of different machine learning techniques for filling in values missing from a dataset. This is a composite dataset made from overlapping parts of seven datasets downloaded from Gapminder. Each constituent dataset represents one of seven variables: happiness index, cell phones per capita, democracy score, human development index, internet users per capita, murders per capita, and percent of women in government, spanning several years. The main objective is to develop models that can accurately infer the missing values to improve data integrity and allow for better analysis, with the hope that any correlations between these variables allow for better interpolation.
First, the data is combined into one dataset to only represent countries and years for which all variables are accounted for. After normalizing the data, the models can be scored on average percent off from the true values. Next, we create and evaluate several imputation models, including zero-fill, mean-fill, linear regression, second-degree polynomial regression, and K-Nearest Neighbors.
Our findings indicate that the linear regression model performs better than other methods in predicting missing values, with an average of 3.09% off from the true values. This study demonstrates how effective machine learning models can be at interpolating missing values in a dataset, especially in cases where multiple variables share correlations.
Cannon, Jaxon, "Filling in Missing Values" (2023). Belmont University Research Symposium (BURS). 196.