Data Preparation and Feature Engineering in High-Dimensional Predictive Modeling
DOI:
https://doi.org/10.63125/5txtr530Keywords:
Data Preprocessing, Feature Engineering, Predictive Modeling, Machine Learning, Data AnalyticsAbstract
Data preprocessing and feature engineering are critical determinants of predictive modeling effectiveness, particularly in large-scale environments characterized by inconsistencies, missing values, and heterogeneous variable structures. This study examined the impact of structured preprocessing and feature engineering strategies on machine learning performance using a quantitative experimental design applied to a dataset of 12,500 observations and 48 predictor variables. Multiple preprocessing techniques, including missing value imputation, normalization, categorical encoding, feature construction, and feature selection, were evaluated across five supervised learning algorithms using repeated 10-fold cross-validation. Results demonstrated substantial performance gains over the baseline model, with average classification accuracy improving from 71.4% to 84.7%, F1-score increasing from 0.69 to 0.86, and AUC-ROC rising from 0.74 to 0.91. Statistical testing confirmed significant improvements at the 0.05 level, with moderate to large effect sizes observed for feature engineering and selection interventions. These findings provide empirical evidence that comprehensive preprocessing pipelines meaningfully enhance predictive accuracy, model robustness, and analytical reliability, underscoring their importance as a foundational component of predictive analytics workflows.
