TY - GEN
T1 - Comparing Predictive Machine Learning Algorithms in Fit for Work Occupational Health Assessments
AU - Charapaqui-Miranda, Saul
AU - Arapa-Apaza, Katherine
AU - Meza-Rodriguez, Moises
AU - Chacon-Torrico, Horacio
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - Some studies have tried to develop predictors for fitness for work (FFW). This study assessed the question whether factors used in the occupational medical practice could predict an individual fit for work result. We used a Peruvian occupational medical examination dataset of 33347 participants. We obtained a reduced dataset of 2650. It was split into two subsets, a training dataset and a test dataset. Using the training dataset, logistic regression, decision tree, random forest, and support vector machine models were fitted, and important variables of each model were identified. Hyperparameter tuning was an important part in these non-parametric models. Also, the Area Under the Curve (AUC) metric was used for Model Selection with a 5-fold cross validation approach. The results shows the Logistic Regression as the most powerful predictor (AUC = 60.44%, Accuracy = 68.05%). It is important to notice the best variables analysis in fitness to work evaluation by a Random Forest approach. Thus, the best model was logistic regression. This also reveals that the criteria associated with the workplace and occupational clinical criteria have a low level of prediction. Further studies should be done with imbalanced data to process bigger datasets, in consequence to obtain more robust models.
AB - Some studies have tried to develop predictors for fitness for work (FFW). This study assessed the question whether factors used in the occupational medical practice could predict an individual fit for work result. We used a Peruvian occupational medical examination dataset of 33347 participants. We obtained a reduced dataset of 2650. It was split into two subsets, a training dataset and a test dataset. Using the training dataset, logistic regression, decision tree, random forest, and support vector machine models were fitted, and important variables of each model were identified. Hyperparameter tuning was an important part in these non-parametric models. Also, the Area Under the Curve (AUC) metric was used for Model Selection with a 5-fold cross validation approach. The results shows the Logistic Regression as the most powerful predictor (AUC = 60.44%, Accuracy = 68.05%). It is important to notice the best variables analysis in fitness to work evaluation by a Random Forest approach. Thus, the best model was logistic regression. This also reveals that the criteria associated with the workplace and occupational clinical criteria have a low level of prediction. Further studies should be done with imbalanced data to process bigger datasets, in consequence to obtain more robust models.
KW - Data science
KW - Machine learning
KW - Occupational health
UR - http://www.scopus.com/inward/record.url?scp=85084850887&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-46140-9_21
DO - 10.1007/978-3-030-46140-9_21
M3 - Contribución a la conferencia
AN - SCOPUS:85084850887
SN - 9783030461393
T3 - Communications in Computer and Information Science
SP - 218
EP - 225
BT - Information Management and Big Data - 6th International Conference, SIMBig 2019, Proceedings
A2 - Lossio-Ventura, Juan Antonio
A2 - Condori-Fernandez, Nelly
A2 - Valverde-Rebaza, Jorge Carlos
PB - Springer
T2 - 6th International Conference on Information Management and Big Data, SIMBig 2019
Y2 - 21 August 2019 through 23 August 2019
ER -