R - Block sampling: generate new unique IDs after sampling?












0















I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.



For example, say my data looks like this:



set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))


In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:



test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample <- test[J(random.block), allow.cartesian=TRUE]


This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.



The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.



In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.



The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.










share|improve this question



























    0















    I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.



    For example, say my data looks like this:



    set.seed(1)
    test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))


    In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:



    test <- as.data.table(test)
    setkey(test, 'block')
    random.block <- sample(unique(test$block), size=10, replace=TRUE)
    random.sample <- test[J(random.block), allow.cartesian=TRUE]


    This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.



    The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.



    In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.



    The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.










    share|improve this question

























      0












      0








      0








      I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.



      For example, say my data looks like this:



      set.seed(1)
      test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))


      In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:



      test <- as.data.table(test)
      setkey(test, 'block')
      random.block <- sample(unique(test$block), size=10, replace=TRUE)
      random.sample <- test[J(random.block), allow.cartesian=TRUE]


      This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.



      The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.



      In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.



      The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.










      share|improve this question














      I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.



      For example, say my data looks like this:



      set.seed(1)
      test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))


      In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:



      test <- as.data.table(test)
      setkey(test, 'block')
      random.block <- sample(unique(test$block), size=10, replace=TRUE)
      random.sample <- test[J(random.block), allow.cartesian=TRUE]


      This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.



      The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.



      In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.



      The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.







      r






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 20 '18 at 19:24









      Ryan SimmonsRyan Simmons

      42821426




      42821426
























          1 Answer
          1






          active

          oldest

          votes


















          1














          I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:



          set.seed(1)
          test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
          test
          test <- as.data.table(test)
          setkey(test, 'block')
          random.block <- sample(unique(test$block), size=10, replace=TRUE)
          random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]


          So instead of just using the vector you create a table with an index id:



          rand.tab <- data.table(block=random.block,id=1:length(random.block))


          And then merge with the test and call the id the block (if you need to):



          random.sample <- test[J(rand.tab), allow.cartesian=TRUE]

          random.sample[,block := id]
          random.sample[,id := NULL]


          To prove it is the same as your original version:



          all(random.sample$X1 == random.sample.orig$X1 & 
          random.sample$X2 == random.sample.orig$X2 &
          random.sample$X3 == random.sample.orig$X3)





          share|improve this answer
























          • Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

            – Ryan Simmons
            Nov 21 '18 at 1:01











          • My first selected answer :) Yes the data.table package is excellent.

            – rookie
            Nov 21 '18 at 10:23











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400145%2fr-block-sampling-generate-new-unique-ids-after-sampling%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:



          set.seed(1)
          test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
          test
          test <- as.data.table(test)
          setkey(test, 'block')
          random.block <- sample(unique(test$block), size=10, replace=TRUE)
          random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]


          So instead of just using the vector you create a table with an index id:



          rand.tab <- data.table(block=random.block,id=1:length(random.block))


          And then merge with the test and call the id the block (if you need to):



          random.sample <- test[J(rand.tab), allow.cartesian=TRUE]

          random.sample[,block := id]
          random.sample[,id := NULL]


          To prove it is the same as your original version:



          all(random.sample$X1 == random.sample.orig$X1 & 
          random.sample$X2 == random.sample.orig$X2 &
          random.sample$X3 == random.sample.orig$X3)





          share|improve this answer
























          • Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

            – Ryan Simmons
            Nov 21 '18 at 1:01











          • My first selected answer :) Yes the data.table package is excellent.

            – rookie
            Nov 21 '18 at 10:23
















          1














          I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:



          set.seed(1)
          test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
          test
          test <- as.data.table(test)
          setkey(test, 'block')
          random.block <- sample(unique(test$block), size=10, replace=TRUE)
          random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]


          So instead of just using the vector you create a table with an index id:



          rand.tab <- data.table(block=random.block,id=1:length(random.block))


          And then merge with the test and call the id the block (if you need to):



          random.sample <- test[J(rand.tab), allow.cartesian=TRUE]

          random.sample[,block := id]
          random.sample[,id := NULL]


          To prove it is the same as your original version:



          all(random.sample$X1 == random.sample.orig$X1 & 
          random.sample$X2 == random.sample.orig$X2 &
          random.sample$X3 == random.sample.orig$X3)





          share|improve this answer
























          • Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

            – Ryan Simmons
            Nov 21 '18 at 1:01











          • My first selected answer :) Yes the data.table package is excellent.

            – rookie
            Nov 21 '18 at 10:23














          1












          1








          1







          I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:



          set.seed(1)
          test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
          test
          test <- as.data.table(test)
          setkey(test, 'block')
          random.block <- sample(unique(test$block), size=10, replace=TRUE)
          random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]


          So instead of just using the vector you create a table with an index id:



          rand.tab <- data.table(block=random.block,id=1:length(random.block))


          And then merge with the test and call the id the block (if you need to):



          random.sample <- test[J(rand.tab), allow.cartesian=TRUE]

          random.sample[,block := id]
          random.sample[,id := NULL]


          To prove it is the same as your original version:



          all(random.sample$X1 == random.sample.orig$X1 & 
          random.sample$X2 == random.sample.orig$X2 &
          random.sample$X3 == random.sample.orig$X3)





          share|improve this answer













          I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:



          set.seed(1)
          test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
          test
          test <- as.data.table(test)
          setkey(test, 'block')
          random.block <- sample(unique(test$block), size=10, replace=TRUE)
          random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]


          So instead of just using the vector you create a table with an index id:



          rand.tab <- data.table(block=random.block,id=1:length(random.block))


          And then merge with the test and call the id the block (if you need to):



          random.sample <- test[J(rand.tab), allow.cartesian=TRUE]

          random.sample[,block := id]
          random.sample[,id := NULL]


          To prove it is the same as your original version:



          all(random.sample$X1 == random.sample.orig$X1 & 
          random.sample$X2 == random.sample.orig$X2 &
          random.sample$X3 == random.sample.orig$X3)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 20 '18 at 20:14









          rookierookie

          763




          763













          • Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

            – Ryan Simmons
            Nov 21 '18 at 1:01











          • My first selected answer :) Yes the data.table package is excellent.

            – rookie
            Nov 21 '18 at 10:23



















          • Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

            – Ryan Simmons
            Nov 21 '18 at 1:01











          • My first selected answer :) Yes the data.table package is excellent.

            – rookie
            Nov 21 '18 at 10:23

















          Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

          – Ryan Simmons
          Nov 21 '18 at 1:01





          Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!

          – Ryan Simmons
          Nov 21 '18 at 1:01













          My first selected answer :) Yes the data.table package is excellent.

          – rookie
          Nov 21 '18 at 10:23





          My first selected answer :) Yes the data.table package is excellent.

          – rookie
          Nov 21 '18 at 10:23


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400145%2fr-block-sampling-generate-new-unique-ids-after-sampling%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

          Alcedinidae

          RAC Tourist Trophy