Version of C# StringBuilder to allow for strings larger than 2 billion characters












5












$begingroup$


In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}









share|improve this question









New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$








  • 4




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    11 hours ago








  • 3




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    11 hours ago








  • 1




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    10 hours ago
















5












$begingroup$


In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}









share|improve this question









New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$








  • 4




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    11 hours ago








  • 3




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    11 hours ago








  • 1




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    10 hours ago














5












5








5





$begingroup$


In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}









share|improve this question









New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}






c# performance strings pagination






share|improve this question









New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 6 hours ago









Simon Forsberg

48.7k7130286




48.7k7130286






New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 13 hours ago









Dan WDan W

1263




1263




New contributor




Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Dan W is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 4




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    11 hours ago








  • 3




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    11 hours ago








  • 1




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    10 hours ago














  • 4




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    11 hours ago








  • 3




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    11 hours ago








  • 1




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    10 hours ago








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    10 hours ago








4




4




$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago






$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
11 hours ago






3




3




$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago






$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
11 hours ago






1




1




$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago






$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
10 hours ago






1




1




$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago






$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
10 hours ago






1




1




$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago




$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
10 hours ago










1 Answer
1






active

oldest

votes


















5












$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    10 hours ago






  • 1




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    10 hours ago










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    7 hours ago











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Dan W is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









5












$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    10 hours ago






  • 1




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    10 hours ago










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    7 hours ago
















5












$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    10 hours ago






  • 1




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    10 hours ago










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    7 hours ago














5












5








5





$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$




    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.







share|improve this answer














share|improve this answer



share|improve this answer








edited 10 hours ago

























answered 11 hours ago









Peter TaylorPeter Taylor

17.7k2962




17.7k2962












  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    10 hours ago






  • 1




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    10 hours ago










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    7 hours ago


















  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    10 hours ago






  • 1




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    10 hours ago










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    7 hours ago
















$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago




$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
10 hours ago




1




1




$begingroup$
Regarding the last point, there's also string.CopyTo.
$endgroup$
– Pieter Witvoet
10 hours ago




$begingroup$
Regarding the last point, there's also string.CopyTo.
$endgroup$
– Pieter Witvoet
10 hours ago












$begingroup$
c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago




$begingroup$
c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
7 hours ago










Dan W is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Dan W is a new contributor. Be nice, and check out our Code of Conduct.













Dan W is a new contributor. Be nice, and check out our Code of Conduct.












Dan W is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

RAC Tourist Trophy