How to fill in NAs of various columns grouped by duplicated IDs in R
I have a table with columns id
, colA
, and colB
. The data contains duplicated id columns where for some rows, colA
or colB
is null, but its duplicated id
has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like
id | colA | colB
1 NA X
1 Y X
2 Z NA
2 Z Y
3 Z Y
3 Z Y
4 NA NA
4 NA NA
and I want my dataframe to look like
id | colA | colB
1 Y X
2 Z Y
3 Z Y
4 NA NA
I usually use the ifelse
statement to replace missing values, but I am confused on how to use this in the context of having duplicated id
s.
r merge na
add a comment |
I have a table with columns id
, colA
, and colB
. The data contains duplicated id columns where for some rows, colA
or colB
is null, but its duplicated id
has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like
id | colA | colB
1 NA X
1 Y X
2 Z NA
2 Z Y
3 Z Y
3 Z Y
4 NA NA
4 NA NA
and I want my dataframe to look like
id | colA | colB
1 Y X
2 Z Y
3 Z Y
4 NA NA
I usually use the ifelse
statement to replace missing values, but I am confused on how to use this in the context of having duplicated id
s.
r merge na
In the case where only one value perid
isNA
, is it always the first row with thatid
, as in your example data?
– neilfws
Nov 20 at 0:40
add a comment |
I have a table with columns id
, colA
, and colB
. The data contains duplicated id columns where for some rows, colA
or colB
is null, but its duplicated id
has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like
id | colA | colB
1 NA X
1 Y X
2 Z NA
2 Z Y
3 Z Y
3 Z Y
4 NA NA
4 NA NA
and I want my dataframe to look like
id | colA | colB
1 Y X
2 Z Y
3 Z Y
4 NA NA
I usually use the ifelse
statement to replace missing values, but I am confused on how to use this in the context of having duplicated id
s.
r merge na
I have a table with columns id
, colA
, and colB
. The data contains duplicated id columns where for some rows, colA
or colB
is null, but its duplicated id
has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like
id | colA | colB
1 NA X
1 Y X
2 Z NA
2 Z Y
3 Z Y
3 Z Y
4 NA NA
4 NA NA
and I want my dataframe to look like
id | colA | colB
1 Y X
2 Z Y
3 Z Y
4 NA NA
I usually use the ifelse
statement to replace missing values, but I am confused on how to use this in the context of having duplicated id
s.
r merge na
r merge na
asked Nov 19 at 23:56
Kevin Sun
11818
11818
In the case where only one value perid
isNA
, is it always the first row with thatid
, as in your example data?
– neilfws
Nov 20 at 0:40
add a comment |
In the case where only one value perid
isNA
, is it always the first row with thatid
, as in your example data?
– neilfws
Nov 20 at 0:40
In the case where only one value per
id
is NA
, is it always the first row with that id
, as in your example data?– neilfws
Nov 20 at 0:40
In the case where only one value per
id
is NA
, is it always the first row with that id
, as in your example data?– neilfws
Nov 20 at 0:40
add a comment |
3 Answers
3
active
oldest
votes
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id
occurs twice and that where one observation is NA
, it is the first observation for that id
, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id
can be NA
, you could try adding a second fill
after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
add a comment |
First add a column that tells how many NA
s in each row. Then using dplyr
, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
add a comment |
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384336%2fhow-to-fill-in-nas-of-various-columns-grouped-by-duplicated-ids-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id
occurs twice and that where one observation is NA
, it is the first observation for that id
, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id
can be NA
, you could try adding a second fill
after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
add a comment |
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id
occurs twice and that where one observation is NA
, it is the first observation for that id
, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id
can be NA
, you could try adding a second fill
after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
add a comment |
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id
occurs twice and that where one observation is NA
, it is the first observation for that id
, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id
can be NA
, you could try adding a second fill
after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id
occurs twice and that where one observation is NA
, it is the first observation for that id
, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id
can be NA
, you could try adding a second fill
after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
answered Nov 20 at 0:47
neilfws
17.5k53648
17.5k53648
add a comment |
add a comment |
First add a column that tells how many NA
s in each row. Then using dplyr
, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
add a comment |
First add a column that tells how many NA
s in each row. Then using dplyr
, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
add a comment |
First add a column that tells how many NA
s in each row. Then using dplyr
, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
First add a column that tells how many NA
s in each row. Then using dplyr
, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
edited Nov 20 at 14:23
answered Nov 20 at 0:44
Shree
3,2861323
3,2861323
add a comment |
add a comment |
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.
add a comment |
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.
add a comment |
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.
answered Nov 20 at 0:32
André.B
528
528
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384336%2fhow-to-fill-in-nas-of-various-columns-grouped-by-duplicated-ids-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
In the case where only one value per
id
isNA
, is it always the first row with thatid
, as in your example data?– neilfws
Nov 20 at 0:40