Regression with Lots of Categorical Variables
I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.
I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.
For examples, are any regressors perhaps more suited to dummy variables?
regression categorical-data categorical-encoding
add a comment |
I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.
I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.
For examples, are any regressors perhaps more suited to dummy variables?
regression categorical-data categorical-encoding
What do you want to accomplish with this regression?
– Dimitriy V. Masterov
Jan 9 at 5:04
@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
Jan 9 at 5:13
How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
Jan 9 at 5:15
I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
Jan 9 at 5:23
add a comment |
I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.
I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.
For examples, are any regressors perhaps more suited to dummy variables?
regression categorical-data categorical-encoding
I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.
I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.
For examples, are any regressors perhaps more suited to dummy variables?
regression categorical-data categorical-encoding
regression categorical-data categorical-encoding
asked Jan 9 at 4:47
OdisseoOdisseo
175
175
What do you want to accomplish with this regression?
– Dimitriy V. Masterov
Jan 9 at 5:04
@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
Jan 9 at 5:13
How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
Jan 9 at 5:15
I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
Jan 9 at 5:23
add a comment |
What do you want to accomplish with this regression?
– Dimitriy V. Masterov
Jan 9 at 5:04
@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
Jan 9 at 5:13
How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
Jan 9 at 5:15
I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
Jan 9 at 5:23
What do you want to accomplish with this regression?
– Dimitriy V. Masterov
Jan 9 at 5:04
What do you want to accomplish with this regression?
– Dimitriy V. Masterov
Jan 9 at 5:04
@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
Jan 9 at 5:13
@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
Jan 9 at 5:13
How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
Jan 9 at 5:15
How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
Jan 9 at 5:15
I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
Jan 9 at 5:23
I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
Jan 9 at 5:23
add a comment |
4 Answers
4
active
oldest
votes
$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.
A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?
Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?
Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.
If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.
Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.
If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.
Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
add a comment |
I will give you a stats answer to an ML question.
First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).
** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.
categorical_var notation)
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
add a comment |
How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
add a comment |
Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.
If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f386247%2fregression-with-lots-of-categorical-variables%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.
A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?
Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?
Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.
If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.
Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.
If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.
Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
add a comment |
$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.
A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?
Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?
Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.
If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.
Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.
If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.
Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
add a comment |
$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.
A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?
Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?
Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.
If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.
Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.
If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.
Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.
$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.
A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?
Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?
Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.
If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.
Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.
If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.
Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.
answered Jan 9 at 5:47
Dave HarrisDave Harris
3,514515
3,514515
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
add a comment |
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
Thank you, great answer!!
– Odisseo
Jan 9 at 6:03
add a comment |
I will give you a stats answer to an ML question.
First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).
** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.
categorical_var notation)
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
add a comment |
I will give you a stats answer to an ML question.
First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).
** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.
categorical_var notation)
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
add a comment |
I will give you a stats answer to an ML question.
First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).
** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.
categorical_var notation)
I will give you a stats answer to an ML question.
First, a good $R^2$ can be not high (whatever that means), it really depends on your initial hypothesis, and sometimes there is just so much that you can do with the data you have (i.e., you have poor explanatory variables). Second, too many categories (that is, explanatory variables or features) can constrain the model too much, this is greatly dependent on your $N$. An indication to this can be to compare the $R^2$ to the $adjusted$ $R^2$ - the latter downwards adjusted to account for too many explanatory variables. If they are similar, your variables are not improving prediction by a lot. If there is a large difference, you might want to cut back (see next). Third, from a theoretical point of view, too many categories (can) be too much. There is something to be said of theoretically backed model parsimony. That is, try logically grouping categories (is there a specific reason that they should be left along? e.g., multilevel models, clustering..). Finally, grouping the categorical variable categories can also help with managing interactions. Often the main effect can be small, but an interaction will show an important association that was hidden (also improving the $R^2$ in the process).
** just as a side note, you don't have to manually encode categorical variables to dummies. In all languages there are ways to handle this (e.g., in R you can use factors, in Stata the i.
categorical_var notation)
answered Jan 9 at 5:12
Yuval SpieglerYuval Spiegler
1,4141827
1,4141827
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
add a comment |
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
Thank you, yeah I definitely used the pandas get_dummies method. In any case, the highest R2 I got was with polynomial features and a GradientBoosting Regressor ad 0.32... all my other models are very low. I am doing some grid search cross validation with different alphas but I think it will be hard to move beyond this limit.
– Odisseo
Jan 9 at 5:20
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
After using get_dummies I end up with about 25 dummy features and 15 original numeric features. I don’t think it’s that much. In any case, I also tried calculating a Pearson correlation matrix earlier and none of the original features had any correlation with the target variable so I might give up
– Odisseo
Jan 9 at 5:21
add a comment |
How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
add a comment |
How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
add a comment |
How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.
How about doing a PCA (Principal Component Analysis)? It allows us to summarize and visualize the information in a dataset by extracting important information from a multivariate data table by giving a few new variables. It is a dimension reduction technique.
answered Jan 9 at 5:14
AnaAna
12
12
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
add a comment |
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
What would be the goal of using PCA? I don’t have that many features (roughly 40 after the augmentation) and only have 50k instances ca. 25 of these features are dummies.
– Odisseo
Jan 9 at 5:24
add a comment |
Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.
If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
add a comment |
Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.
If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
add a comment |
Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.
If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.
Regression is not the best tool if you want good out-of-sample predictive performance on stable data. I would try using a random forest on your problem. This method will handle automatic interaction/non-linearity detection for you and is not too difficult to use off the shelf.
If you want to stick with regression, I would try to include some interactions between your variables chosen based on the domain expertise, while ensuring that the cells don't get too small.
edited Jan 9 at 5:32
answered Jan 9 at 5:30
Dimitriy V. MasterovDimitriy V. Masterov
20.5k14092
20.5k14092
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
add a comment |
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
Thank you, I will try it again with a grid search and see how it does. I had tried it with no grid search and it didn’t do too well
– Odisseo
Jan 9 at 5:31
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f386247%2fregression-with-lots-of-categorical-variables%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What do you want to accomplish with this regression?
– Dimitriy V. Masterov
Jan 9 at 5:04
@DimitriyV.Masterov I would like to predict a continuous numeric variable. Thanks!
– Odisseo
Jan 9 at 5:13
How much data do you have in terms of observations and non-excluded dummies?
– Dimitriy V. Masterov
Jan 9 at 5:15
I have about 50k instances with 15 original continuous features to which I added some 25 dummy variables which originated from 10 categorical features.
– Odisseo
Jan 9 at 5:23