Regression with Lots of Categorical Variables












1















I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.



I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.



For examples, are any regressors perhaps more suited to dummy variables?










share|cite|improve this question























  • What do you want to accomplish with this regression?

    – Dimitriy V. Masterov
    Jan 9 at 5:04











  • @DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!

    – Odisseo
    Jan 9 at 5:13











  • How much data do you have in terms of observations and non-excluded dummies?

    – Dimitriy V. Masterov
    Jan 9 at 5:15











  • I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.

    – Odisseo
    Jan 9 at 5:23
















1















I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.



I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.



For examples, are any regressors perhaps more suited to dummy variables?










share|cite|improve this question























  • What do you want to accomplish with this regression?

    – Dimitriy V. Masterov
    Jan 9 at 5:04











  • @DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!

    – Odisseo
    Jan 9 at 5:13











  • How much data do you have in terms of observations and non-excluded dummies?

    – Dimitriy V. Masterov
    Jan 9 at 5:15











  • I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.

    – Odisseo
    Jan 9 at 5:23














1












1








1








I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.



I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.



For examples, are any regressors perhaps more suited to dummy variables?










share|cite|improve this question














I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.



I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.



For examples, are any regressors perhaps more suited to dummy variables?







regression categorical-data categorical-encoding






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Jan 9 at 4:47









OdisseoOdisseo

175




175













  • What do you want to accomplish with this regression?

    – Dimitriy V. Masterov
    Jan 9 at 5:04











  • @DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!

    – Odisseo
    Jan 9 at 5:13











  • How much data do you have in terms of observations and non-excluded dummies?

    – Dimitriy V. Masterov
    Jan 9 at 5:15











  • I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.

    – Odisseo
    Jan 9 at 5:23



















  • What do you want to accomplish with this regression?

    – Dimitriy V. Masterov
    Jan 9 at 5:04











  • @DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!

    – Odisseo
    Jan 9 at 5:13











  • How much data do you have in terms of observations and non-excluded dummies?

    – Dimitriy V. Masterov
    Jan 9 at 5:15











  • I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.

    – Odisseo
    Jan 9 at 5:23

















What do you want to accomplish with this regression?

– Dimitriy V. Masterov
Jan 9 at 5:04





What do you want to accomplish with this regression?

– Dimitriy V. Masterov
Jan 9 at 5:04













@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!

– Odisseo
Jan 9 at 5:13





@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!

– Odisseo
Jan 9 at 5:13













How much data do you have in terms of observations and non-excluded dummies?

– Dimitriy V. Masterov
Jan 9 at 5:15





How much data do you have in terms of observations and non-excluded dummies?

– Dimitriy V. Masterov
Jan 9 at 5:15













I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.

– Odisseo
Jan 9 at 5:23





I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.

– Odisseo
Jan 9 at 5:23










4 Answers
4






active

oldest

votes


















5














$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.



A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?



Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?



Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.



If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.



Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.



If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.



Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.






share|cite|improve this answer
























  • Thank you, great answer!!

    – Odisseo
    Jan 9 at 6:03



















0














I will give you a stats answer to an ML question.



First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).



** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.categorical_var notation)






share|cite|improve this answer
























  • Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

    – Odisseo
    Jan 9 at 5:20











  • After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

    – Odisseo
    Jan 9 at 5:21



















0














How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.






share|cite|improve this answer
























  • What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

    – Odisseo
    Jan 9 at 5:24



















0














Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.



If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.






share|cite|improve this answer


























  • Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

    – Odisseo
    Jan 9 at 5:31











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f386247%2fregression-with-lots-of-categorical-variables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























4 Answers
4






active

oldest

votes








4 Answers
4






active

oldest

votes









active

oldest

votes






active

oldest

votes









5














$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.



A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?



Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?



Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.



If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.



Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.



If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.



Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.






share|cite|improve this answer
























  • Thank you, great answer!!

    – Odisseo
    Jan 9 at 6:03
















5














$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.



A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?



Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?



Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.



If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.



Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.



If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.



Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.






share|cite|improve this answer
























  • Thank you, great answer!!

    – Odisseo
    Jan 9 at 6:03














5












5








5







$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.



A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?



Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?



Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.



If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.



Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.



If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.



Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.






share|cite|improve this answer













$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.



A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?



Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?



Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.



If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.



Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.



If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.



Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Jan 9 at 5:47









Dave HarrisDave Harris

3,514515




3,514515













  • Thank you, great answer!!

    – Odisseo
    Jan 9 at 6:03



















  • Thank you, great answer!!

    – Odisseo
    Jan 9 at 6:03

















Thank you, great answer!!

– Odisseo
Jan 9 at 6:03





Thank you, great answer!!

– Odisseo
Jan 9 at 6:03













0














I will give you a stats answer to an ML question.



First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).



** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.categorical_var notation)






share|cite|improve this answer
























  • Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

    – Odisseo
    Jan 9 at 5:20











  • After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

    – Odisseo
    Jan 9 at 5:21
















0














I will give you a stats answer to an ML question.



First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).



** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.categorical_var notation)






share|cite|improve this answer
























  • Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

    – Odisseo
    Jan 9 at 5:20











  • After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

    – Odisseo
    Jan 9 at 5:21














0












0








0







I will give you a stats answer to an ML question.



First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).



** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.categorical_var notation)






share|cite|improve this answer













I will give you a stats answer to an ML question.



First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).



** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.categorical_var notation)







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Jan 9 at 5:12









Yuval SpieglerYuval Spiegler

1,4141827




1,4141827













  • Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

    – Odisseo
    Jan 9 at 5:20











  • After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

    – Odisseo
    Jan 9 at 5:21



















  • Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

    – Odisseo
    Jan 9 at 5:20











  • After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

    – Odisseo
    Jan 9 at 5:21

















Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

– Odisseo
Jan 9 at 5:20





Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.

– Odisseo
Jan 9 at 5:20













After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

– Odisseo
Jan 9 at 5:21





After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up

– Odisseo
Jan 9 at 5:21











0














How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.






share|cite|improve this answer
























  • What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

    – Odisseo
    Jan 9 at 5:24
















0














How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.






share|cite|improve this answer
























  • What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

    – Odisseo
    Jan 9 at 5:24














0












0








0







How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.






share|cite|improve this answer













How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Jan 9 at 5:14









AnaAna

12




12













  • What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

    – Odisseo
    Jan 9 at 5:24



















  • What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

    – Odisseo
    Jan 9 at 5:24

















What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

– Odisseo
Jan 9 at 5:24





What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.

– Odisseo
Jan 9 at 5:24











0














Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.



If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.






share|cite|improve this answer


























  • Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

    – Odisseo
    Jan 9 at 5:31
















0














Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.



If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.






share|cite|improve this answer


























  • Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

    – Odisseo
    Jan 9 at 5:31














0












0








0







Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.



If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.






share|cite|improve this answer















Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.



If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Jan 9 at 5:32

























answered Jan 9 at 5:30









Dimitriy V. MasterovDimitriy V. Masterov

20.5k14092




20.5k14092













  • Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

    – Odisseo
    Jan 9 at 5:31



















  • Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

    – Odisseo
    Jan 9 at 5:31

















Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

– Odisseo
Jan 9 at 5:31





Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well

– Odisseo
Jan 9 at 5:31


















draft saved

draft discarded




















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f386247%2fregression-with-lots-of-categorical-variables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

Origin of the phrase “under your belt”?