Is it ever recommended to use mean/multiple imputation when using tree-based predictive models?
4
1
$begingroup$
Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative). The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly. I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?