Drawing equally-sized samples from differently-sized substrata of a dataframe in R [duplicate]
up vote
0
down vote
favorite
This question already has an answer here:
Sample n random rows per group in a dataframe
5 answers
Stratified random sampling from data frame
4 answers
I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?
r for-loop sample
marked as duplicate by Henrik
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
up vote
0
down vote
favorite
This question already has an answer here:
Sample n random rows per group in a dataframe
5 answers
Stratified random sampling from data frame
4 answers
I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?
r for-loop sample
marked as duplicate by Henrik
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
This question already has an answer here:
Sample n random rows per group in a dataframe
5 answers
Stratified random sampling from data frame
4 answers
I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?
r for-loop sample
This question already has an answer here:
Sample n random rows per group in a dataframe
5 answers
Stratified random sampling from data frame
4 answers
I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?
This question already has an answer here:
Sample n random rows per group in a dataframe
5 answers
Stratified random sampling from data frame
4 answers
r for-loop sample
r for-loop sample
asked 11 hours ago
Chris Ruehlemann
1288
1288
marked as duplicate by Henrik
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by Henrik
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
up vote
2
down vote
accepted
Consider by
to split data frame by position with needed sampling. Then rbind
all dfs together outside the loop with do.call()
.
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind
inside a for
loop which is memory-intensive and not advised.
Specifically,
by
is the object-oriented wrapper totapply
and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call
essentially runs a compact version of an expanded call across multiple elements whererbind(df1, df2, df3)
is equivalent todo.call(rbind, list(df1, df2, df3))
. The key here to note isrbind
is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
add a comment |
up vote
0
down vote
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
New contributor
add a comment |
up vote
0
down vote
We can use data.table
with a group by sample
of the row index .I
and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n
from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Consider by
to split data frame by position with needed sampling. Then rbind
all dfs together outside the loop with do.call()
.
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind
inside a for
loop which is memory-intensive and not advised.
Specifically,
by
is the object-oriented wrapper totapply
and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call
essentially runs a compact version of an expanded call across multiple elements whererbind(df1, df2, df3)
is equivalent todo.call(rbind, list(df1, df2, df3))
. The key here to note isrbind
is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
add a comment |
up vote
2
down vote
accepted
Consider by
to split data frame by position with needed sampling. Then rbind
all dfs together outside the loop with do.call()
.
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind
inside a for
loop which is memory-intensive and not advised.
Specifically,
by
is the object-oriented wrapper totapply
and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call
essentially runs a compact version of an expanded call across multiple elements whererbind(df1, df2, df3)
is equivalent todo.call(rbind, list(df1, df2, df3))
. The key here to note isrbind
is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Consider by
to split data frame by position with needed sampling. Then rbind
all dfs together outside the loop with do.call()
.
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind
inside a for
loop which is memory-intensive and not advised.
Specifically,
by
is the object-oriented wrapper totapply
and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call
essentially runs a compact version of an expanded call across multiple elements whererbind(df1, df2, df3)
is equivalent todo.call(rbind, list(df1, df2, df3))
. The key here to note isrbind
is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Consider by
to split data frame by position with needed sampling. Then rbind
all dfs together outside the loop with do.call()
.
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind
inside a for
loop which is memory-intensive and not advised.
Specifically,
by
is the object-oriented wrapper totapply
and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
do.call
essentially runs a compact version of an expanded call across multiple elements whererbind(df1, df2, df3)
is equivalent todo.call(rbind, list(df1, df2, df3))
. The key here to note isrbind
is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
edited 7 hours ago
answered 11 hours ago
Parfait
47.9k84066
47.9k84066
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
add a comment |
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago
add a comment |
up vote
0
down vote
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
New contributor
add a comment |
up vote
0
down vote
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
New contributor
add a comment |
up vote
0
down vote
up vote
0
down vote
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
New contributor
Each time you run the loop you are overwriting the last entry. Try:
df_samples <- data.frame()
df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])
New contributor
New contributor
answered 11 hours ago
xsabatox
1
1
New contributor
New contributor
add a comment |
add a comment |
up vote
0
down vote
We can use data.table
with a group by sample
of the row index .I
and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n
from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}
add a comment |
up vote
0
down vote
We can use data.table
with a group by sample
of the row index .I
and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n
from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}
add a comment |
up vote
0
down vote
up vote
0
down vote
We can use data.table
with a group by sample
of the row index .I
and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n
from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}
We can use data.table
with a group by sample
of the row index .I
and use that to subset the dataset. This would be very efficient
i1 <- setDT(df)[, sample(.I, 3), position]$V1
df[i1]
Or use sample_n
from tidyverse
library(tidyverse)
df %>%
group_by(position) %>%
sample_n(3)
Or as a function
f1 <- function(data) {
data as.data.table(data)
i1 <- data[, sample(.I, 3), by = position]$V1
data[i1]
}
edited 11 hours ago
answered 11 hours ago
akrun
388k13177250
388k13177250
add a comment |
add a comment |