How to detect lines that are unique in large file using Reactive Extensions
I have to process large CSV files (up to tens of GB), that looks like this:
Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true
I have a parser that yields parsed lines as IEnumerable<Record>
, so that I reads only one line at a time into memory.
Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.
In this case it is record with key 3.
However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.
I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.
Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?
c# system.reactive yield file-processing
|
show 1 more comment
I have to process large CSV files (up to tens of GB), that looks like this:
Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true
I have a parser that yields parsed lines as IEnumerable<Record>
, so that I reads only one line at a time into memory.
Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.
In this case it is record with key 3.
However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.
I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.
Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?
c# system.reactive yield file-processing
1
Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– Michael Randall
Nov 22 '18 at 8:17
records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key)
might get you started. You'll need nuget.org/packages/morelinq for this.
– mjwills
Nov 22 '18 at 8:44
How many distinctKey
s do you have?
– Dmitry Bychenko
Nov 22 '18 at 8:56
@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 '18 at 9:18
@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 '18 at 9:22
|
show 1 more comment
I have to process large CSV files (up to tens of GB), that looks like this:
Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true
I have a parser that yields parsed lines as IEnumerable<Record>
, so that I reads only one line at a time into memory.
Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.
In this case it is record with key 3.
However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.
I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.
Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?
c# system.reactive yield file-processing
I have to process large CSV files (up to tens of GB), that looks like this:
Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true
I have a parser that yields parsed lines as IEnumerable<Record>
, so that I reads only one line at a time into memory.
Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.
In this case it is record with key 3.
However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.
I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.
Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?
c# system.reactive yield file-processing
c# system.reactive yield file-processing
asked Nov 22 '18 at 8:13
LieroLiero
9,603644113
9,603644113
1
Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– Michael Randall
Nov 22 '18 at 8:17
records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key)
might get you started. You'll need nuget.org/packages/morelinq for this.
– mjwills
Nov 22 '18 at 8:44
How many distinctKey
s do you have?
– Dmitry Bychenko
Nov 22 '18 at 8:56
@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 '18 at 9:18
@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 '18 at 9:22
|
show 1 more comment
1
Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– Michael Randall
Nov 22 '18 at 8:17
records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key)
might get you started. You'll need nuget.org/packages/morelinq for this.
– mjwills
Nov 22 '18 at 8:44
How many distinctKey
s do you have?
– Dmitry Bychenko
Nov 22 '18 at 8:56
@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 '18 at 9:18
@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 '18 at 9:22
1
1
Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– Michael Randall
Nov 22 '18 at 8:17
Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– Michael Randall
Nov 22 '18 at 8:17
records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key)
might get you started. You'll need nuget.org/packages/morelinq for this.– mjwills
Nov 22 '18 at 8:44
records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key)
might get you started. You'll need nuget.org/packages/morelinq for this.– mjwills
Nov 22 '18 at 8:44
How many distinct
Key
s do you have?– Dmitry Bychenko
Nov 22 '18 at 8:56
How many distinct
Key
s do you have?– Dmitry Bychenko
Nov 22 '18 at 8:56
@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 '18 at 9:18
@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 '18 at 9:18
@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 '18 at 9:22
@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 '18 at 9:22
|
show 1 more comment
2 Answers
2
active
oldest
votes
Providing that Key
is an integer we can try using a Dictionary
and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(@"c:MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
@Liero - You would need to ensure that theIEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.
– Enigmativity
Nov 22 '18 at 11:09
add a comment |
I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426450%2fhow-to-detect-lines-that-are-unique-in-large-file-using-reactive-extensions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Providing that Key
is an integer we can try using a Dictionary
and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(@"c:MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
@Liero - You would need to ensure that theIEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.
– Enigmativity
Nov 22 '18 at 11:09
add a comment |
Providing that Key
is an integer we can try using a Dictionary
and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(@"c:MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
@Liero - You would need to ensure that theIEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.
– Enigmativity
Nov 22 '18 at 11:09
add a comment |
Providing that Key
is an integer we can try using a Dictionary
and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(@"c:MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);
Providing that Key
is an integer we can try using a Dictionary
and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(@"c:MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);
edited Nov 22 '18 at 8:59
answered Nov 22 '18 at 8:46
Dmitry BychenkoDmitry Bychenko
108k1093133
108k1093133
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
@Liero - You would need to ensure that theIEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.
– Enigmativity
Nov 22 '18 at 11:09
add a comment |
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
@Liero - You would need to ensure that theIEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.
– Enigmativity
Nov 22 '18 at 11:09
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
– Liero
Nov 22 '18 at 9:27
@Liero - You would need to ensure that the
IEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.– Enigmativity
Nov 22 '18 at 11:09
@Liero - You would need to ensure that the
IEnumerable<Record>
is lazy to make Rx efficient, but if it is then a simple loop will be too.– Enigmativity
Nov 22 '18 at 11:09
add a comment |
I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.
add a comment |
I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.
add a comment |
I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.
I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.
answered Nov 22 '18 at 11:24
EnigmativityEnigmativity
76.7k865132
76.7k865132
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426450%2fhow-to-detect-lines-that-are-unique-in-large-file-using-reactive-extensions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– Michael Randall
Nov 22 '18 at 8:17
records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key)
might get you started. You'll need nuget.org/packages/morelinq for this.– mjwills
Nov 22 '18 at 8:44
How many distinct
Key
s do you have?– Dmitry Bychenko
Nov 22 '18 at 8:56
@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 '18 at 9:18
@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 '18 at 9:22