Version of C# StringBuilder to allow for strings larger than 2 billion characters
$begingroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
New contributor
$endgroup$
|
show 9 more comments
$begingroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
New contributor
$endgroup$
4
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago
3
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago
1
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago
|
show 9 more comments
$begingroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
New contributor
$endgroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
c# performance strings pagination
New contributor
New contributor
edited 6 hours ago
Simon Forsberg♦
48.7k7130286
48.7k7130286
New contributor
asked 13 hours ago
Dan WDan W
1263
1263
New contributor
New contributor
4
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago
3
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago
1
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago
|
show 9 more comments
4
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago
3
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago
1
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago
4
4
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago
3
3
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago
1
1
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago
1
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago
1
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago
|
show 9 more comments
1 Answer
1
active
oldest
votes
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
1
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Dan W is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
1
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago
add a comment |
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
1
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago
add a comment |
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
edited 10 hours ago
answered 11 hours ago
Peter TaylorPeter Taylor
17.7k2962
17.7k2962
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
1
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago
add a comment |
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
1
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago
1
1
$begingroup$
Regarding the last point, there's also
string.CopyTo
.$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
Regarding the last point, there's also
string.CopyTo
.$endgroup$
– Pieter Witvoet
10 hours ago
$begingroup$
c# class instance members are private by default, so
c
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…$endgroup$
– BurnsBA
7 hours ago
$begingroup$
c# class instance members are private by default, so
c
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…$endgroup$
– BurnsBA
7 hours ago
add a comment |
Dan W is a new contributor. Be nice, and check out our Code of Conduct.
Dan W is a new contributor. Be nice, and check out our Code of Conduct.
Dan W is a new contributor. Be nice, and check out our Code of Conduct.
Dan W is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago
3
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago
1
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago