Drawing equally-sized samples from differently-sized substrata of a dataframe in R [duplicate]

up vote
0
down vote

favorite

This question already has an answer here:

Sample n random rows per group in a dataframe

5 answers

Stratified random sampling from data frame

4 answers

I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:

df <- data.frame(

  word = sample(LETTERS, 100, replace = T),

  position = sample(1:5, 100, replace = T)

)

head(df)

  word position

1    K        1

2    R        5

3    J        2

4    Y        5

5    Z        5

6    U        4

Obviously, the tranches of 'position' are differently sized:

table(df$position)

 1  2  3  4  5 

15 15 17 28 25

To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:

df_pos1 <- df[df$position==1,]

df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]



df_pos2 <- df[df$position==2,]

df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]



df_pos3 <- df[df$position==3,]

df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]



df_pos4 <- df[df$position==4,]

df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]



df_pos5 <- df[df$position==5,]

df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]

and so on, to finally combine the individual samples in a single dataframe:

df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)

but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':

df_samples <-c()

for(i in unique(df$position)){

   df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])

}

df_samples

   word position

13    D        2

2     R        5

12    G        3

4     Y        5

16    Z        3

11    S        3

6     U        4

14    J        3

9     O        5

1     K        1

What's wrong with this code and how can it be improved?

asked 11 hours ago

Chris Ruehlemann

1288

marked as duplicate by Henrik r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

up vote
0
down vote

favorite

This question already has an answer here:

Sample n random rows per group in a dataframe

5 answers

Stratified random sampling from data frame

4 answers

I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:

df <- data.frame(

  word = sample(LETTERS, 100, replace = T),

  position = sample(1:5, 100, replace = T)

)

head(df)

  word position

1    K        1

2    R        5

3    J        2

4    Y        5

5    Z        5

6    U        4

Obviously, the tranches of 'position' are differently sized:

table(df$position)

 1  2  3  4  5 

15 15 17 28 25

To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:

df_pos1 <- df[df$position==1,]

df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]



df_pos2 <- df[df$position==2,]

df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]



df_pos3 <- df[df$position==3,]

df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]



df_pos4 <- df[df$position==4,]

df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]



df_pos5 <- df[df$position==5,]

df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]

and so on, to finally combine the individual samples in a single dataframe:

df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)

df_samples <-c()

for(i in unique(df$position)){

   df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])

}

df_samples

   word position

13    D        2

2     R        5

12    G        3

4     Y        5

16    Z        3

11    S        3

6     U        4

14    J        3

9     O        5

1     K        1

What's wrong with this code and how can it be improved?

asked 11 hours ago

Chris Ruehlemann

1288

marked as duplicate by Henrik r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

up vote
0
down vote

favorite

This question already has an answer here:

Sample n random rows per group in a dataframe

5 answers

Stratified random sampling from data frame

4 answers

I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:

df <- data.frame(

  word = sample(LETTERS, 100, replace = T),

  position = sample(1:5, 100, replace = T)

)

head(df)

  word position

1    K        1

2    R        5

3    J        2

4    Y        5

5    Z        5

6    U        4

Obviously, the tranches of 'position' are differently sized:

table(df$position)

 1  2  3  4  5 

15 15 17 28 25

To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:

df_pos1 <- df[df$position==1,]

df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]



df_pos2 <- df[df$position==2,]

df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]



df_pos3 <- df[df$position==3,]

df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]



df_pos4 <- df[df$position==4,]

df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]



df_pos5 <- df[df$position==5,]

df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]

and so on, to finally combine the individual samples in a single dataframe:

df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)

df_samples <-c()

for(i in unique(df$position)){

   df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])

}

df_samples

   word position

13    D        2

2     R        5

12    G        3

4     Y        5

16    Z        3

11    S        3

6     U        4

14    J        3

9     O        5

1     K        1

What's wrong with this code and how can it be improved?

asked 11 hours ago

Chris Ruehlemann

1288

This question already has an answer here:

Sample n random rows per group in a dataframe

5 answers

Stratified random sampling from data frame

4 answers

I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:

df <- data.frame(

  word = sample(LETTERS, 100, replace = T),

  position = sample(1:5, 100, replace = T)

)

head(df)

  word position

1    K        1

2    R        5

3    J        2

4    Y        5

5    Z        5

6    U        4

Obviously, the tranches of 'position' are differently sized:

table(df$position)

 1  2  3  4  5 

15 15 17 28 25

To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:

df_pos1 <- df[df$position==1,]

df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]



df_pos2 <- df[df$position==2,]

df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]



df_pos3 <- df[df$position==3,]

df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]



df_pos4 <- df[df$position==4,]

df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]



df_pos5 <- df[df$position==5,]

df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]

and so on, to finally combine the individual samples in a single dataframe:

df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)

df_samples <-c()

for(i in unique(df$position)){

   df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])

}

df_samples

   word position

13    D        2

2     R        5

12    G        3

4     Y        5

16    Z        3

11    S        3

6     U        4

14    J        3

9     O        5

1     K        1

What's wrong with this code and how can it be improved?

This question already has an answer here:

Sample n random rows per group in a dataframe

5 answers

Stratified random sampling from data frame

4 answers

r for-loop sample

asked 11 hours ago

Chris Ruehlemann

1288

asked 11 hours ago

Chris Ruehlemann

1288

asked 11 hours ago

Chris Ruehlemann

1288

asked 11 hours ago

Chris Ruehlemann

1288

asked 11 hours ago

Chris Ruehlemann

1288

marked as duplicate by Henrik r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by Henrik r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
11 hours ago

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().

df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])



final_df <- do.call(rbind, df_list)

Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.

Specifically,

by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.

do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.

edited 7 hours ago

answered 11 hours ago

Parfait

47.9k84066

Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago

add a comment |

up vote
0
down vote

Each time you run the loop you are overwriting the last entry. Try:

df_samples <- data.frame()

df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])

answered 11 hours ago

xsabatox

New contributor

add a comment |

up vote
0
down vote

We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient

i1 <- setDT(df)[, sample(.I, 3), position]$V1

df[i1]

Or use sample_n from tidyverse

library(tidyverse)

df %>% 

   group_by(position) %>% 

   sample_n(3)

Or as a function

f1 <- function(data) {

     data as.data.table(data)

     i1 <- data[, sample(.I, 3), by = position]$V1

     data[i1]

    }

edited 11 hours ago

answered 11 hours ago

akrun

388k13177250

add a comment |

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().

df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])



final_df <- do.call(rbind, df_list)

Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.

Specifically,

by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.

do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.

edited 7 hours ago

answered 11 hours ago

Parfait

47.9k84066

Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago

add a comment |

up vote
2
down vote

accepted

Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().

df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])



final_df <- do.call(rbind, df_list)

Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.

Specifically,

by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.

do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.

edited 7 hours ago

answered 11 hours ago

Parfait

47.9k84066

Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago

add a comment |

up vote
2
down vote

accepted

Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().

df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])



final_df <- do.call(rbind, df_list)

Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.

Specifically,

by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.

do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.

edited 7 hours ago

answered 11 hours ago

Parfait

47.9k84066

Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().

df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])



final_df <- do.call(rbind, df_list)

Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.

Specifically,

by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.

do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.

edited 7 hours ago

answered 11 hours ago

Parfait

47.9k84066

edited 7 hours ago

answered 11 hours ago

Parfait

47.9k84066

answered 11 hours ago

Parfait

47.9k84066

answered 11 hours ago

Parfait

47.9k84066

Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago

add a comment |

Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago

Could you maybe comment on the key elements of the code such as 'by', 'sub', and 'do-call'? Much appreciated!
– Chris Ruehlemann
9 hours ago

add a comment |

up vote
0
down vote

Each time you run the loop you are overwriting the last entry. Try:

df_samples <- data.frame()

df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])

answered 11 hours ago

xsabatox

New contributor

add a comment |

up vote
0
down vote

Each time you run the loop you are overwriting the last entry. Try:

df_samples <- data.frame()

df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])

answered 11 hours ago

xsabatox

New contributor

add a comment |

up vote
0
down vote

Each time you run the loop you are overwriting the last entry. Try:

df_samples <- data.frame()

df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])

answered 11 hours ago

xsabatox

New contributor

Each time you run the loop you are overwriting the last entry. Try:

df_samples <- data.frame()

df_samples <- rbind(df_samples, df[sample(1:nrow(df[df$position==i,]), 3),])

answered 11 hours ago

xsabatox

New contributor

answered 11 hours ago

xsabatox

New contributor

answered 11 hours ago

xsabatox

answered 11 hours ago

xsabatox

New contributor

xsabatox is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

up vote
0
down vote

We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient

i1 <- setDT(df)[, sample(.I, 3), position]$V1

df[i1]

Or use sample_n from tidyverse

library(tidyverse)

df %>% 

   group_by(position) %>% 

   sample_n(3)

Or as a function

f1 <- function(data) {

     data as.data.table(data)

     i1 <- data[, sample(.I, 3), by = position]$V1

     data[i1]

    }

edited 11 hours ago

answered 11 hours ago

akrun

388k13177250

add a comment |

up vote
0
down vote

We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient

i1 <- setDT(df)[, sample(.I, 3), position]$V1

df[i1]

Or use sample_n from tidyverse

library(tidyverse)

df %>% 

   group_by(position) %>% 

   sample_n(3)

Or as a function

f1 <- function(data) {

     data as.data.table(data)

     i1 <- data[, sample(.I, 3), by = position]$V1

     data[i1]

    }

edited 11 hours ago

answered 11 hours ago

akrun

388k13177250

add a comment |

up vote
0
down vote

We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient

i1 <- setDT(df)[, sample(.I, 3), position]$V1

df[i1]

Or use sample_n from tidyverse

library(tidyverse)

df %>% 

   group_by(position) %>% 

   sample_n(3)

Or as a function

f1 <- function(data) {

     data as.data.table(data)

     i1 <- data[, sample(.I, 3), by = position]$V1

     data[i1]

    }

edited 11 hours ago

answered 11 hours ago

akrun

388k13177250

We can use data.table with a group by sample of the row index .I and use that to subset the dataset. This would be very efficient

i1 <- setDT(df)[, sample(.I, 3), position]$V1

df[i1]

Or use sample_n from tidyverse

library(tidyverse)

df %>% 

   group_by(position) %>% 

   sample_n(3)

Or as a function

f1 <- function(data) {

     data as.data.table(data)

     i1 <- data[, sample(.I, 3), by = position]$V1

     data[i1]

    }

edited 11 hours ago

answered 11 hours ago

akrun

388k13177250

edited 11 hours ago

answered 11 hours ago

akrun

388k13177250

answered 11 hours ago

akrun

388k13177250

answered 11 hours ago

akrun

388k13177250

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr