ggplot2 and CSV “inventing” data that isn't in my input

up vote
0
down vote

favorite

I'm attempting to produce an attractive graph of bandwidth data across a number of machines and tests. My attempts seem to work for small manually entered amounts of data, but when I feed the "full" 1773 entries, I get results in my graph that don't seem to exist in the input data.

I believe this is likely because the different tests are each of different duration, but I can't seem to prove this. If I use the following input data as csv (sorry, off-site because of size) I end up with a strange upwards-curve on my geom_smooth line, and additional data points that I can't actually see in my .csv input data. (I have much more data in real life, this is a subset that produces the strange behaviour)

I would expect the first four tries (try01-try04) to flat-line at zero, and try05 to carry on at around 1GBit/sec. Here's my code

library("ggplot2")

library("RColorBrewer")



speed = read.csv(file="data.csv")



svg("all_results.svg",width=24)

ggplot(speed,

    aes(x = Second, y = Bandwidth, group=Test, colour=Test)) +

    scale_fill_brewer(palette="Paired") +

    geom_point() +

    geom_smooth()

dev.off()

Here's the image produced

@Gregor seems to be exactly right in that the seconds are interpreted as text, when they should represent the number of the seconds since the start of that test.
Here's some example input data - please note the times are not always on a .00 second boundary due to the output of iperf.

structure(list(Machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "valhalla", class = "factor"), 

    User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "alice", class = "factor"), 

    Test = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "try01", class = "factor"), 

    Second = structure(c(1L, 2L, 13L, 14L, 15L, 16L, 17L, 18L, 

    19L, 20L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("0.00-1.00", 

    "1.00-2.00", "10.00-11.00", "11.00-12.00", "12.00-13.00", 

    "13.00-14.00", "14.00-15.00", "15.00-16.00", "16.00-17.00", 

    "17.00-18.00", "18.00-19.00", "19.00-20.00", "2.00-3.00", 

    "3.00-4.00", "4.00-5.00", "5.00-6.00", "6.00-7.00", "7.00-8.00", 

    "8.00-9.00", "9.00-10.00"), class = "factor"), Bandwidth = c(937, 

    943, 944, 943, 943, 943, 943, 944, 658, 943, 944, 943, 944, 

    644, 943, 943, 943, 944, 943, 943)), row.names = c(NA, 20L

), class = "data.frame")

I'll try casting (or whatever R calls it) those to a float now.

edited Nov 19 at 16:46

asked Nov 19 at 13:45

user9793038

104

Your x-axis looks categorical, which means it is probably a factor and is ordered alphabetically. You don't share any data, and we can't read any values off your chart, but I would guess that your Second column should be treated as numeric and you should convert it. If you share some sample data we can help with that.
– Gregor
Nov 19 at 16:04

(And by don't share any data, I mean in the question itself. dput(droplevels(head(speed, 20))) would be a great way to share the top 20 rows of your data, in a copy/pasteable way that shows the object structure and classes. And it doesn't require asking people to download and import some large data.
– Gregor
Nov 19 at 16:06

Ah, you're exactly correct @Gregor - my time is being treated as text. It's of the form "9.01-10.00" and "12.00-13.00" (i.e. approximately one second per sample). I'll update my question to include the dput as it's too large for the comment
– user9793038
Nov 19 at 16:41

add a comment |

up vote
0
down vote

favorite

I would expect the first four tries (try01-try04) to flat-line at zero, and try05 to carry on at around 1GBit/sec. Here's my code

library("ggplot2")

library("RColorBrewer")



speed = read.csv(file="data.csv")



svg("all_results.svg",width=24)

ggplot(speed,

    aes(x = Second, y = Bandwidth, group=Test, colour=Test)) +

    scale_fill_brewer(palette="Paired") +

    geom_point() +

    geom_smooth()

dev.off()

Here's the image produced

structure(list(Machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "valhalla", class = "factor"), 

    User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "alice", class = "factor"), 

    Test = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "try01", class = "factor"), 

    Second = structure(c(1L, 2L, 13L, 14L, 15L, 16L, 17L, 18L, 

    19L, 20L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("0.00-1.00", 

    "1.00-2.00", "10.00-11.00", "11.00-12.00", "12.00-13.00", 

    "13.00-14.00", "14.00-15.00", "15.00-16.00", "16.00-17.00", 

    "17.00-18.00", "18.00-19.00", "19.00-20.00", "2.00-3.00", 

    "3.00-4.00", "4.00-5.00", "5.00-6.00", "6.00-7.00", "7.00-8.00", 

    "8.00-9.00", "9.00-10.00"), class = "factor"), Bandwidth = c(937, 

    943, 944, 943, 943, 943, 943, 944, 658, 943, 944, 943, 944, 

    644, 943, 943, 943, 944, 943, 943)), row.names = c(NA, 20L

), class = "data.frame")

I'll try casting (or whatever R calls it) those to a float now.

edited Nov 19 at 16:46

asked Nov 19 at 13:45

user9793038

104

Your x-axis looks categorical, which means it is probably a factor and is ordered alphabetically. You don't share any data, and we can't read any values off your chart, but I would guess that your Second column should be treated as numeric and you should convert it. If you share some sample data we can help with that.
– Gregor
Nov 19 at 16:04

(And by don't share any data, I mean in the question itself. dput(droplevels(head(speed, 20))) would be a great way to share the top 20 rows of your data, in a copy/pasteable way that shows the object structure and classes. And it doesn't require asking people to download and import some large data.
– Gregor
Nov 19 at 16:06

Ah, you're exactly correct @Gregor - my time is being treated as text. It's of the form "9.01-10.00" and "12.00-13.00" (i.e. approximately one second per sample). I'll update my question to include the dput as it's too large for the comment
– user9793038
Nov 19 at 16:41

add a comment |

up vote
0
down vote

favorite

I would expect the first four tries (try01-try04) to flat-line at zero, and try05 to carry on at around 1GBit/sec. Here's my code

library("ggplot2")

library("RColorBrewer")



speed = read.csv(file="data.csv")



svg("all_results.svg",width=24)

ggplot(speed,

    aes(x = Second, y = Bandwidth, group=Test, colour=Test)) +

    scale_fill_brewer(palette="Paired") +

    geom_point() +

    geom_smooth()

dev.off()

Here's the image produced

structure(list(Machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "valhalla", class = "factor"), 

    User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "alice", class = "factor"), 

    Test = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "try01", class = "factor"), 

    Second = structure(c(1L, 2L, 13L, 14L, 15L, 16L, 17L, 18L, 

    19L, 20L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("0.00-1.00", 

    "1.00-2.00", "10.00-11.00", "11.00-12.00", "12.00-13.00", 

    "13.00-14.00", "14.00-15.00", "15.00-16.00", "16.00-17.00", 

    "17.00-18.00", "18.00-19.00", "19.00-20.00", "2.00-3.00", 

    "3.00-4.00", "4.00-5.00", "5.00-6.00", "6.00-7.00", "7.00-8.00", 

    "8.00-9.00", "9.00-10.00"), class = "factor"), Bandwidth = c(937, 

    943, 944, 943, 943, 943, 943, 944, 658, 943, 944, 943, 944, 

    644, 943, 943, 943, 944, 943, 943)), row.names = c(NA, 20L

), class = "data.frame")

I'll try casting (or whatever R calls it) those to a float now.

edited Nov 19 at 16:46

asked Nov 19 at 13:45

user9793038

104

I would expect the first four tries (try01-try04) to flat-line at zero, and try05 to carry on at around 1GBit/sec. Here's my code

library("ggplot2")

library("RColorBrewer")



speed = read.csv(file="data.csv")



svg("all_results.svg",width=24)

ggplot(speed,

    aes(x = Second, y = Bandwidth, group=Test, colour=Test)) +

    scale_fill_brewer(palette="Paired") +

    geom_point() +

    geom_smooth()

dev.off()

Here's the image produced

structure(list(Machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "valhalla", class = "factor"), 

    User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "alice", class = "factor"), 

    Test = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 

    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "try01", class = "factor"), 

    Second = structure(c(1L, 2L, 13L, 14L, 15L, 16L, 17L, 18L, 

    19L, 20L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("0.00-1.00", 

    "1.00-2.00", "10.00-11.00", "11.00-12.00", "12.00-13.00", 

    "13.00-14.00", "14.00-15.00", "15.00-16.00", "16.00-17.00", 

    "17.00-18.00", "18.00-19.00", "19.00-20.00", "2.00-3.00", 

    "3.00-4.00", "4.00-5.00", "5.00-6.00", "6.00-7.00", "7.00-8.00", 

    "8.00-9.00", "9.00-10.00"), class = "factor"), Bandwidth = c(937, 

    943, 944, 943, 943, 943, 943, 944, 658, 943, 944, 943, 944, 

    644, 943, 943, 943, 944, 943, 943)), row.names = c(NA, 20L

), class = "data.frame")

I'll try casting (or whatever R calls it) those to a float now.

r csv ggplot2

edited Nov 19 at 16:46

asked Nov 19 at 13:45

user9793038

104

edited Nov 19 at 16:46

asked Nov 19 at 13:45

user9793038

104

edited Nov 19 at 16:46

asked Nov 19 at 13:45

user9793038

104

asked Nov 19 at 13:45

user9793038

104

asked Nov 19 at 13:45

user9793038

104

Your x-axis looks categorical, which means it is probably a factor and is ordered alphabetically. You don't share any data, and we can't read any values off your chart, but I would guess that your Second column should be treated as numeric and you should convert it. If you share some sample data we can help with that.
– Gregor
Nov 19 at 16:04

(And by don't share any data, I mean in the question itself. dput(droplevels(head(speed, 20))) would be a great way to share the top 20 rows of your data, in a copy/pasteable way that shows the object structure and classes. And it doesn't require asking people to download and import some large data.
– Gregor
Nov 19 at 16:06

Ah, you're exactly correct @Gregor - my time is being treated as text. It's of the form "9.01-10.00" and "12.00-13.00" (i.e. approximately one second per sample). I'll update my question to include the dput as it's too large for the comment
– user9793038
Nov 19 at 16:41

add a comment |

Your x-axis looks categorical, which means it is probably a factor and is ordered alphabetically. You don't share any data, and we can't read any values off your chart, but I would guess that your Second column should be treated as numeric and you should convert it. If you share some sample data we can help with that.
– Gregor
Nov 19 at 16:04

(And by don't share any data, I mean in the question itself. dput(droplevels(head(speed, 20))) would be a great way to share the top 20 rows of your data, in a copy/pasteable way that shows the object structure and classes. And it doesn't require asking people to download and import some large data.
– Gregor
Nov 19 at 16:06

Ah, you're exactly correct @Gregor - my time is being treated as text. It's of the form "9.01-10.00" and "12.00-13.00" (i.e. approximately one second per sample). I'll update my question to include the dput as it's too large for the comment
– user9793038
Nov 19 at 16:41

Your x-axis looks categorical, which means it is probably a factor and is ordered alphabetically. You don't share any data, and we can't read any values off your chart, but I would guess that your Second column should be treated as numeric and you should convert it. If you share some sample data we can help with that.
– Gregor
Nov 19 at 16:04

(And by don't share any data, I mean in the question itself. dput(droplevels(head(speed, 20))) would be a great way to share the top 20 rows of your data, in a copy/pasteable way that shows the object structure and classes. And it doesn't require asking people to download and import some large data.
– Gregor
Nov 19 at 16:06

Ah, you're exactly correct @Gregor - my time is being treated as text. It's of the form "9.01-10.00" and "12.00-13.00" (i.e. approximately one second per sample). I'll update my question to include the dput as it's too large for the comment
– user9793038
Nov 19 at 16:41

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Points have a single x value, not a range of x-values, so we'll separate your Second column into beginning and end of the interval and plot the points at the beginning. Calling your data dd"

library(tidyr)

library(dplyr)

dd = dd %>%

  separate(Second, into = c("sec_start", "sec_end"), sep = "-", remove = FALSE) %>%

  mutate(sec_start = as.numeric(sec_start),

         sec_end = as.numeric(sec_end))

After that the plotting should go just fine if you put sec_start or sec_end on the x-axis. (Or calculate the middle, whatever you want...)

If you want to visualize the durations, you could use geom_segment and aes(x = sec_start, xend = sec_end, y = Bandwidth, yend = Bandwidth), but since everything is just about the same duration, it doesn't seem like this would add much value.

answered Nov 19 at 17:52

Gregor

62.3k988163

Thanks Gregor, your answer worked more or less verbatim. I now need to read up on tidyr and dplyr, and where that "%>" syntax comes from. I'm not yet deep enough into R for the landscape to make complete sense, and I'm still stuck in my more perl-aware thinking...
– user9793038
Nov 20 at 9:04

Glad it helped. I'd strongly recommend the package vignette An Introduction to dplyr. Covers all the dplyr basics, including %>%.
– Gregor
Nov 20 at 14:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53375984%2fggplot2-and-csv-inventing-data-that-isnt-in-my-input%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Points have a single x value, not a range of x-values, so we'll separate your Second column into beginning and end of the interval and plot the points at the beginning. Calling your data dd"

library(tidyr)

library(dplyr)

dd = dd %>%

  separate(Second, into = c("sec_start", "sec_end"), sep = "-", remove = FALSE) %>%

  mutate(sec_start = as.numeric(sec_start),

         sec_end = as.numeric(sec_end))

After that the plotting should go just fine if you put sec_start or sec_end on the x-axis. (Or calculate the middle, whatever you want...)

answered Nov 19 at 17:52

Gregor

62.3k988163

Thanks Gregor, your answer worked more or less verbatim. I now need to read up on tidyr and dplyr, and where that "%>" syntax comes from. I'm not yet deep enough into R for the landscape to make complete sense, and I'm still stuck in my more perl-aware thinking...
– user9793038
Nov 20 at 9:04

Glad it helped. I'd strongly recommend the package vignette An Introduction to dplyr. Covers all the dplyr basics, including %>%.
– Gregor
Nov 20 at 14:10

add a comment |

up vote
0
down vote

accepted

Points have a single x value, not a range of x-values, so we'll separate your Second column into beginning and end of the interval and plot the points at the beginning. Calling your data dd"

library(tidyr)

library(dplyr)

dd = dd %>%

  separate(Second, into = c("sec_start", "sec_end"), sep = "-", remove = FALSE) %>%

  mutate(sec_start = as.numeric(sec_start),

         sec_end = as.numeric(sec_end))

After that the plotting should go just fine if you put sec_start or sec_end on the x-axis. (Or calculate the middle, whatever you want...)

answered Nov 19 at 17:52

Gregor

62.3k988163

Thanks Gregor, your answer worked more or less verbatim. I now need to read up on tidyr and dplyr, and where that "%>" syntax comes from. I'm not yet deep enough into R for the landscape to make complete sense, and I'm still stuck in my more perl-aware thinking...
– user9793038
Nov 20 at 9:04

Glad it helped. I'd strongly recommend the package vignette An Introduction to dplyr. Covers all the dplyr basics, including %>%.
– Gregor
Nov 20 at 14:10

add a comment |

up vote
0
down vote

accepted

Points have a single x value, not a range of x-values, so we'll separate your Second column into beginning and end of the interval and plot the points at the beginning. Calling your data dd"

library(tidyr)

library(dplyr)

dd = dd %>%

  separate(Second, into = c("sec_start", "sec_end"), sep = "-", remove = FALSE) %>%

  mutate(sec_start = as.numeric(sec_start),

         sec_end = as.numeric(sec_end))

After that the plotting should go just fine if you put sec_start or sec_end on the x-axis. (Or calculate the middle, whatever you want...)

answered Nov 19 at 17:52

Gregor

62.3k988163

Points have a single x value, not a range of x-values, so we'll separate your Second column into beginning and end of the interval and plot the points at the beginning. Calling your data dd"

library(tidyr)

library(dplyr)

dd = dd %>%

  separate(Second, into = c("sec_start", "sec_end"), sep = "-", remove = FALSE) %>%

  mutate(sec_start = as.numeric(sec_start),

         sec_end = as.numeric(sec_end))

After that the plotting should go just fine if you put sec_start or sec_end on the x-axis. (Or calculate the middle, whatever you want...)

answered Nov 19 at 17:52

Gregor

62.3k988163

answered Nov 19 at 17:52

Gregor

62.3k988163

answered Nov 19 at 17:52

Gregor

62.3k988163

answered Nov 19 at 17:52

Gregor

62.3k988163

Thanks Gregor, your answer worked more or less verbatim. I now need to read up on tidyr and dplyr, and where that "%>" syntax comes from. I'm not yet deep enough into R for the landscape to make complete sense, and I'm still stuck in my more perl-aware thinking...
– user9793038
Nov 20 at 9:04

Glad it helped. I'd strongly recommend the package vignette An Introduction to dplyr. Covers all the dplyr basics, including %>%.
– Gregor
Nov 20 at 14:10

add a comment |

Thanks Gregor, your answer worked more or less verbatim. I now need to read up on tidyr and dplyr, and where that "%>" syntax comes from. I'm not yet deep enough into R for the landscape to make complete sense, and I'm still stuck in my more perl-aware thinking...
– user9793038
Nov 20 at 9:04

Glad it helped. I'd strongly recommend the package vignette An Introduction to dplyr. Covers all the dplyr basics, including %>%.
– Gregor
Nov 20 at 14:10

Thanks Gregor, your answer worked more or less verbatim. I now need to read up on tidyr and dplyr, and where that "%>" syntax comes from. I'm not yet deep enough into R for the landscape to make complete sense, and I'm still stuck in my more perl-aware thinking...
– user9793038
Nov 20 at 9:04

Glad it helped. I'd strongly recommend the package vignette An Introduction to dplyr. Covers all the dplyr basics, including %>%.
– Gregor
Nov 20 at 14:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr