R - Block sampling: generate new unique IDs after sampling?
I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.
For example, say my data looks like this:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample <- test[J(random.block), allow.cartesian=TRUE]
This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.
The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.
In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.
The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.
r
add a comment |
I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.
For example, say my data looks like this:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample <- test[J(random.block), allow.cartesian=TRUE]
This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.
The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.
In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.
The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.
r
add a comment |
I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.
For example, say my data looks like this:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample <- test[J(random.block), allow.cartesian=TRUE]
This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.
The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.
In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.
The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.
r
I have data that are grouped into blocks, or clusters. I would like to generate a number of bootstrap samples for model evaluation with this data, where the blocks/clusters are sampled with replacement. However, this puts me in a bit of a dilemma when it comes to the analysis portion, because I have repeats of the block/cluster identifier.
For example, say my data looks like this:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
In practice I will be performing a number of bootstrap samples, but for didactic purposes let's say I only want a single new dataset, where I have randomly selected IDs with replacement from the original dataset, above, as follows:
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample <- test[J(random.block), allow.cartesian=TRUE]
This works as intended: it creates a new dataset of the same size as the original dataset, but where the blocks have been randomly sampled with replacement.
The problem is this: in the original dataset, each block has only 5 observations (in my real dataset, the number of observations for block is variable, for the record). In the new dataset, while each block has only 5 observations, since I have sampled with replacement I now have multiple blocks with the same ID number.
In the new dataset, if I try to run any sort of analysis that is stratified or contingent upon on the block identification number (e.g. something as simple as the average of the X variables per block, or more complicated analyses like a mixed model with a random effect on block), it treats the repetitions of a block ID as a single block. So instead of, say, 3 different blocks of size 5, it gives me one block of size 15. This can have profound effects on the analysis, not to mention the interpretation of any results.
The question I have: how could I go about assigning a new unique block ID in my randomly sampled dataset? Such that after I have sampled with replacement, each sample of each block has a unique identifier, so that in my final analysis they would be treated as separate blocks rather than a single larger block? I can think of ad hoc ways of doing this (e.g. if each block has the same number of observations), but nothing simple or generalizable.
r
r
asked Nov 20 '18 at 19:24
Ryan SimmonsRyan Simmons
42821426
42821426
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
test
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]
So instead of just using the vector you create a table with an index id:
rand.tab <- data.table(block=random.block,id=1:length(random.block))
And then merge with the test and call the id the block (if you need to):
random.sample <- test[J(rand.tab), allow.cartesian=TRUE]
random.sample[,block := id]
random.sample[,id := NULL]
To prove it is the same as your original version:
all(random.sample$X1 == random.sample.orig$X1 &
random.sample$X2 == random.sample.orig$X2 &
random.sample$X3 == random.sample.orig$X3)
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400145%2fr-block-sampling-generate-new-unique-ids-after-sampling%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
test
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]
So instead of just using the vector you create a table with an index id:
rand.tab <- data.table(block=random.block,id=1:length(random.block))
And then merge with the test and call the id the block (if you need to):
random.sample <- test[J(rand.tab), allow.cartesian=TRUE]
random.sample[,block := id]
random.sample[,id := NULL]
To prove it is the same as your original version:
all(random.sample$X1 == random.sample.orig$X1 &
random.sample$X2 == random.sample.orig$X2 &
random.sample$X3 == random.sample.orig$X3)
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
add a comment |
I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
test
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]
So instead of just using the vector you create a table with an index id:
rand.tab <- data.table(block=random.block,id=1:length(random.block))
And then merge with the test and call the id the block (if you need to):
random.sample <- test[J(rand.tab), allow.cartesian=TRUE]
random.sample[,block := id]
random.sample[,id := NULL]
To prove it is the same as your original version:
all(random.sample$X1 == random.sample.orig$X1 &
random.sample$X2 == random.sample.orig$X2 &
random.sample$X3 == random.sample.orig$X3)
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
add a comment |
I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
test
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]
So instead of just using the vector you create a table with an index id:
rand.tab <- data.table(block=random.block,id=1:length(random.block))
And then merge with the test and call the id the block (if you need to):
random.sample <- test[J(rand.tab), allow.cartesian=TRUE]
random.sample[,block := id]
random.sample[,id := NULL]
To prove it is the same as your original version:
all(random.sample$X1 == random.sample.orig$X1 &
random.sample$X2 == random.sample.orig$X2 &
random.sample$X3 == random.sample.orig$X3)
I think the best way would be to create a data.table with an index based on the key. You can then merge based on the key:
set.seed(1)
test <- data.frame(block = rep(1:10, each = 5), matrix(rnorm(150), ncol = 3))
test
test <- as.data.table(test)
setkey(test, 'block')
random.block <- sample(unique(test$block), size=10, replace=TRUE)
random.sample.orig <- test[J(random.block), allow.cartesian=TRUE]
So instead of just using the vector you create a table with an index id:
rand.tab <- data.table(block=random.block,id=1:length(random.block))
And then merge with the test and call the id the block (if you need to):
random.sample <- test[J(rand.tab), allow.cartesian=TRUE]
random.sample[,block := id]
random.sample[,id := NULL]
To prove it is the same as your original version:
all(random.sample$X1 == random.sample.orig$X1 &
random.sample$X2 == random.sample.orig$X2 &
random.sample$X3 == random.sample.orig$X3)
answered Nov 20 '18 at 20:14
rookierookie
763
763
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
add a comment |
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
Nice solution! Simple and elegant. I had figured the solution would look something like that, but hadn't progressed beyond a rather messy/crude do loop. Nice that the solution is in data.table. Thanks!
– Ryan Simmons
Nov 21 '18 at 1:01
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
My first selected answer :) Yes the data.table package is excellent.
– rookie
Nov 21 '18 at 10:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400145%2fr-block-sampling-generate-new-unique-ids-after-sampling%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown