Training Error is Lower than Testing error in a Random Forest Model












0















I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.



   label   count                                                                
0 0.0 1,021,095
1 1.0 4459


The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling



train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()

#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()

#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()


Below is how I am training my model



labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)


featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)

from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])

# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder()
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
.addGrid(rf_model.impurity,['entropy','gini'])
.addGrid(rf_model.maxDepth,[2,3,4,5])
.build()


# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)

train_model = crossval.fit(train_up/stratified_train)


Below are the results from both the methods



#UpSampling - Training                                 
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377

#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704

#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556

#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427


Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.



I was hoping if someone could please help me out with this model and help me to clear my doubts.



Thanks a lot in advance !!










share|improve this question



























    0















    I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.



       label   count                                                                
    0 0.0 1,021,095
    1 1.0 4459


    The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling



    train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
    train_initial.groupby('label').count().toPandas()
    test.groupby('label').count().toPandas()

    #Sampling Techniques --- Should be done one of these
    #Upsampling ----
    df_class_0 = train_initial[train_initial['label'] == 0]
    df_class_1 = train_initial[train_initial['label'] == 1]
    df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
    train_up = df_class_0.union(df_class_1_over)
    train_up.groupby('label').count().toPandas()

    #Down Sampling
    stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
    stratified_train.groupby('label').count().toPandas()


    Below is how I am training my model



    labelIndexer = StringIndexer(inputCol='label',
    outputCol='indexedLabel').fit(new_data)


    featureIndexer = VectorIndexer(inputCol='features',
    outputCol='indexedFeatures',
    maxCategories=2).fit(new_data)

    from pyspark.ml.classification import RandomForestClassifier
    rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

    labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
    labels=labelIndexer.labels)

    # Chain indexers and tree in a Pipeline
    pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])

    # Search through random forest maxDepth parameter for best model
    paramGrid = ParamGridBuilder()
    .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
    .addGrid(rf_model.impurity,['entropy','gini'])
    .addGrid(rf_model.maxDepth,[2,3,4,5])
    .build()


    # Set up 5-fold cross validation
    crossval = CrossValidator(estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=BinaryClassificationEvaluator(),
    numFolds=5)

    train_model = crossval.fit(train_up/stratified_train)


    Below are the results from both the methods



    #UpSampling - Training                                 
    Train Error = 0.184633
    precision: 0.8565508112679312
    recall: 0.6597217024736883
    auroc: 0.9062348758176568
    f1 : 0.7453609484359377

    #Upsampling - Test
    Test Error = 0.0781619
    precision: 0.054455645977569946
    recall: 0.6503868471953579
    auroc: 0.8982212236597943
    f1 : 0.10049688048716704

    #UnderSampling - Training
    Train Error = 0.179293
    precision: 0.8468290542023261
    recall: 0.781807131280389
    f1 : 0.8130201200884863
    auroc: 0.9129391668636556

    #UnderSamping - Test
    Test Error = 0.147874
    precision: 0.034453223699706645
    recall: 0.778046421663443
    f1 : 0.06598453935901905
    auroc: 0.8989720777537427


    Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.



    I was hoping if someone could please help me out with this model and help me to clear my doubts.



    Thanks a lot in advance !!










    share|improve this question

























      0












      0








      0








      I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.



         label   count                                                                
      0 0.0 1,021,095
      1 1.0 4459


      The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling



      train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
      train_initial.groupby('label').count().toPandas()
      test.groupby('label').count().toPandas()

      #Sampling Techniques --- Should be done one of these
      #Upsampling ----
      df_class_0 = train_initial[train_initial['label'] == 0]
      df_class_1 = train_initial[train_initial['label'] == 1]
      df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
      train_up = df_class_0.union(df_class_1_over)
      train_up.groupby('label').count().toPandas()

      #Down Sampling
      stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
      stratified_train.groupby('label').count().toPandas()


      Below is how I am training my model



      labelIndexer = StringIndexer(inputCol='label',
      outputCol='indexedLabel').fit(new_data)


      featureIndexer = VectorIndexer(inputCol='features',
      outputCol='indexedFeatures',
      maxCategories=2).fit(new_data)

      from pyspark.ml.classification import RandomForestClassifier
      rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

      labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
      labels=labelIndexer.labels)

      # Chain indexers and tree in a Pipeline
      pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])

      # Search through random forest maxDepth parameter for best model
      paramGrid = ParamGridBuilder()
      .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
      .addGrid(rf_model.impurity,['entropy','gini'])
      .addGrid(rf_model.maxDepth,[2,3,4,5])
      .build()


      # Set up 5-fold cross validation
      crossval = CrossValidator(estimator=pipeline,
      estimatorParamMaps=paramGrid,
      evaluator=BinaryClassificationEvaluator(),
      numFolds=5)

      train_model = crossval.fit(train_up/stratified_train)


      Below are the results from both the methods



      #UpSampling - Training                                 
      Train Error = 0.184633
      precision: 0.8565508112679312
      recall: 0.6597217024736883
      auroc: 0.9062348758176568
      f1 : 0.7453609484359377

      #Upsampling - Test
      Test Error = 0.0781619
      precision: 0.054455645977569946
      recall: 0.6503868471953579
      auroc: 0.8982212236597943
      f1 : 0.10049688048716704

      #UnderSampling - Training
      Train Error = 0.179293
      precision: 0.8468290542023261
      recall: 0.781807131280389
      f1 : 0.8130201200884863
      auroc: 0.9129391668636556

      #UnderSamping - Test
      Test Error = 0.147874
      precision: 0.034453223699706645
      recall: 0.778046421663443
      f1 : 0.06598453935901905
      auroc: 0.8989720777537427


      Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.



      I was hoping if someone could please help me out with this model and help me to clear my doubts.



      Thanks a lot in advance !!










      share|improve this question














      I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.



         label   count                                                                
      0 0.0 1,021,095
      1 1.0 4459


      The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling



      train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
      train_initial.groupby('label').count().toPandas()
      test.groupby('label').count().toPandas()

      #Sampling Techniques --- Should be done one of these
      #Upsampling ----
      df_class_0 = train_initial[train_initial['label'] == 0]
      df_class_1 = train_initial[train_initial['label'] == 1]
      df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
      train_up = df_class_0.union(df_class_1_over)
      train_up.groupby('label').count().toPandas()

      #Down Sampling
      stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
      stratified_train.groupby('label').count().toPandas()


      Below is how I am training my model



      labelIndexer = StringIndexer(inputCol='label',
      outputCol='indexedLabel').fit(new_data)


      featureIndexer = VectorIndexer(inputCol='features',
      outputCol='indexedFeatures',
      maxCategories=2).fit(new_data)

      from pyspark.ml.classification import RandomForestClassifier
      rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

      labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
      labels=labelIndexer.labels)

      # Chain indexers and tree in a Pipeline
      pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])

      # Search through random forest maxDepth parameter for best model
      paramGrid = ParamGridBuilder()
      .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
      .addGrid(rf_model.impurity,['entropy','gini'])
      .addGrid(rf_model.maxDepth,[2,3,4,5])
      .build()


      # Set up 5-fold cross validation
      crossval = CrossValidator(estimator=pipeline,
      estimatorParamMaps=paramGrid,
      evaluator=BinaryClassificationEvaluator(),
      numFolds=5)

      train_model = crossval.fit(train_up/stratified_train)


      Below are the results from both the methods



      #UpSampling - Training                                 
      Train Error = 0.184633
      precision: 0.8565508112679312
      recall: 0.6597217024736883
      auroc: 0.9062348758176568
      f1 : 0.7453609484359377

      #Upsampling - Test
      Test Error = 0.0781619
      precision: 0.054455645977569946
      recall: 0.6503868471953579
      auroc: 0.8982212236597943
      f1 : 0.10049688048716704

      #UnderSampling - Training
      Train Error = 0.179293
      precision: 0.8468290542023261
      recall: 0.781807131280389
      f1 : 0.8130201200884863
      auroc: 0.9129391668636556

      #UnderSamping - Test
      Test Error = 0.147874
      precision: 0.034453223699706645
      recall: 0.778046421663443
      f1 : 0.06598453935901905
      auroc: 0.8989720777537427


      Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.



      I was hoping if someone could please help me out with this model and help me to clear my doubts.



      Thanks a lot in advance !!







      machine-learning random-forest sampling






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 '18 at 21:27









      Tushar MehtaTushar Mehta

      417




      417
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.



          Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.



          https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html



          Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.



          Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.






          share|improve this answer
























          • Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

            – Tushar Mehta
            Nov 22 '18 at 0:49











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420692%2ftraining-error-is-lower-than-testing-error-in-a-random-forest-model%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.



          Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.



          https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html



          Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.



          Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.






          share|improve this answer
























          • Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

            – Tushar Mehta
            Nov 22 '18 at 0:49
















          0














          Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.



          Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.



          https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html



          Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.



          Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.






          share|improve this answer
























          • Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

            – Tushar Mehta
            Nov 22 '18 at 0:49














          0












          0








          0







          Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.



          Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.



          https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html



          Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.



          Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.






          share|improve this answer













          Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.



          Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.



          https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html



          Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.



          Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 21 '18 at 22:30









          sjishansjishan

          5952629




          5952629













          • Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

            – Tushar Mehta
            Nov 22 '18 at 0:49



















          • Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

            – Tushar Mehta
            Nov 22 '18 at 0:49

















          Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

          – Tushar Mehta
          Nov 22 '18 at 0:49





          Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

          – Tushar Mehta
          Nov 22 '18 at 0:49




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420692%2ftraining-error-is-lower-than-testing-error-in-a-random-forest-model%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

          Alcedinidae

          Origin of the phrase “under your belt”?