Confusion about different running times of two algorithms in C [duplicate]












16
















This question already has an answer here:




  • Accessing elements of a matrix row-wise versus column-wise

    3 answers




I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:



long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}

long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}

return sum;
}


When I execute the two functions with the given array, I get running times:



sum1: 0.19s



sum2: 1.25s



Can anyone explain why there is this huge difference?










share|improve this question















marked as duplicate by Alexei Levenkov, OmG, jxh c
Users with the  c badge can single-handedly close c questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
3 hours ago


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.



















  • You compiled with all retail optimizations on, right?

    – selbie
    7 hours ago











  • @selbie I used gcc -0O -lrt matrix_sum.c -o matrix_sum

    – Boris Grunwald
    7 hours ago






  • 2





    As an aside, the register keyword doesn't do much of anything these days. Compilers pretty much ignore it aside from the fact that a register-qualified variable is not addressable.

    – Christian Gibbons
    7 hours ago











  • If your timing program executes the same function over and over again, then you are timing how efficiently the hardware moves the matrix into your cache. If your cache is large enough, the whole matrix will fit into the cache after the first pass, and the timing differences will be minimal.

    – jxh
    7 hours ago











  • I wonder if loop interchange optimization would render the two implementations equally fast, as seen in Why is it faster to process a sorted array than an unsorted array?.

    – Marc.2377
    6 hours ago
















16
















This question already has an answer here:




  • Accessing elements of a matrix row-wise versus column-wise

    3 answers




I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:



long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}

long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}

return sum;
}


When I execute the two functions with the given array, I get running times:



sum1: 0.19s



sum2: 1.25s



Can anyone explain why there is this huge difference?










share|improve this question















marked as duplicate by Alexei Levenkov, OmG, jxh c
Users with the  c badge can single-handedly close c questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
3 hours ago


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.



















  • You compiled with all retail optimizations on, right?

    – selbie
    7 hours ago











  • @selbie I used gcc -0O -lrt matrix_sum.c -o matrix_sum

    – Boris Grunwald
    7 hours ago






  • 2





    As an aside, the register keyword doesn't do much of anything these days. Compilers pretty much ignore it aside from the fact that a register-qualified variable is not addressable.

    – Christian Gibbons
    7 hours ago











  • If your timing program executes the same function over and over again, then you are timing how efficiently the hardware moves the matrix into your cache. If your cache is large enough, the whole matrix will fit into the cache after the first pass, and the timing differences will be minimal.

    – jxh
    7 hours ago











  • I wonder if loop interchange optimization would render the two implementations equally fast, as seen in Why is it faster to process a sorted array than an unsorted array?.

    – Marc.2377
    6 hours ago














16












16








16


6







This question already has an answer here:




  • Accessing elements of a matrix row-wise versus column-wise

    3 answers




I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:



long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}

long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}

return sum;
}


When I execute the two functions with the given array, I get running times:



sum1: 0.19s



sum2: 1.25s



Can anyone explain why there is this huge difference?










share|improve this question

















This question already has an answer here:




  • Accessing elements of a matrix row-wise versus column-wise

    3 answers




I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:



long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}

long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;

for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}

return sum;
}


When I execute the two functions with the given array, I get running times:



sum1: 0.19s



sum2: 1.25s



Can anyone explain why there is this huge difference?





This question already has an answer here:




  • Accessing elements of a matrix row-wise versus column-wise

    3 answers








c






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 6 hours ago









Govind Parmar

11.3k53361




11.3k53361










asked 7 hours ago









Boris GrunwaldBoris Grunwald

331210




331210




marked as duplicate by Alexei Levenkov, OmG, jxh c
Users with the  c badge can single-handedly close c questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
3 hours ago


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









marked as duplicate by Alexei Levenkov, OmG, jxh c
Users with the  c badge can single-handedly close c questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
3 hours ago


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.















  • You compiled with all retail optimizations on, right?

    – selbie
    7 hours ago











  • @selbie I used gcc -0O -lrt matrix_sum.c -o matrix_sum

    – Boris Grunwald
    7 hours ago






  • 2





    As an aside, the register keyword doesn't do much of anything these days. Compilers pretty much ignore it aside from the fact that a register-qualified variable is not addressable.

    – Christian Gibbons
    7 hours ago











  • If your timing program executes the same function over and over again, then you are timing how efficiently the hardware moves the matrix into your cache. If your cache is large enough, the whole matrix will fit into the cache after the first pass, and the timing differences will be minimal.

    – jxh
    7 hours ago











  • I wonder if loop interchange optimization would render the two implementations equally fast, as seen in Why is it faster to process a sorted array than an unsorted array?.

    – Marc.2377
    6 hours ago



















  • You compiled with all retail optimizations on, right?

    – selbie
    7 hours ago











  • @selbie I used gcc -0O -lrt matrix_sum.c -o matrix_sum

    – Boris Grunwald
    7 hours ago






  • 2





    As an aside, the register keyword doesn't do much of anything these days. Compilers pretty much ignore it aside from the fact that a register-qualified variable is not addressable.

    – Christian Gibbons
    7 hours ago











  • If your timing program executes the same function over and over again, then you are timing how efficiently the hardware moves the matrix into your cache. If your cache is large enough, the whole matrix will fit into the cache after the first pass, and the timing differences will be minimal.

    – jxh
    7 hours ago











  • I wonder if loop interchange optimization would render the two implementations equally fast, as seen in Why is it faster to process a sorted array than an unsorted array?.

    – Marc.2377
    6 hours ago

















You compiled with all retail optimizations on, right?

– selbie
7 hours ago





You compiled with all retail optimizations on, right?

– selbie
7 hours ago













@selbie I used gcc -0O -lrt matrix_sum.c -o matrix_sum

– Boris Grunwald
7 hours ago





@selbie I used gcc -0O -lrt matrix_sum.c -o matrix_sum

– Boris Grunwald
7 hours ago




2




2





As an aside, the register keyword doesn't do much of anything these days. Compilers pretty much ignore it aside from the fact that a register-qualified variable is not addressable.

– Christian Gibbons
7 hours ago





As an aside, the register keyword doesn't do much of anything these days. Compilers pretty much ignore it aside from the fact that a register-qualified variable is not addressable.

– Christian Gibbons
7 hours ago













If your timing program executes the same function over and over again, then you are timing how efficiently the hardware moves the matrix into your cache. If your cache is large enough, the whole matrix will fit into the cache after the first pass, and the timing differences will be minimal.

– jxh
7 hours ago





If your timing program executes the same function over and over again, then you are timing how efficiently the hardware moves the matrix into your cache. If your cache is large enough, the whole matrix will fit into the cache after the first pass, and the timing differences will be minimal.

– jxh
7 hours ago













I wonder if loop interchange optimization would render the two implementations equally fast, as seen in Why is it faster to process a sorted array than an unsorted array?.

– Marc.2377
6 hours ago





I wonder if loop interchange optimization would render the two implementations equally fast, as seen in Why is it faster to process a sorted array than an unsorted array?.

– Marc.2377
6 hours ago












6 Answers
6






active

oldest

votes


















16














Computers generally use cache to help speed up access to main memory.



The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.



Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.



The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.



So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.



Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.



Related Resource: Memory layout of multi-dimensional arrays






share|improve this answer


























  • Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

    – jamesqf
    3 hours ago



















22














C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:




Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).




Emphasis mine.



Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:



row and column major ordering



The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.






share|improve this answer





















  • 3





    Upvoted for actually citing the standard.

    – Robert Harvey
    7 hours ago





















2














On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).



By "skipping" data (reading far from the previous read), the CPU has to read the memory again.



That's why your first snippet is faster.



For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.



To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.



for (i=0; i < ROWS; i++) {
const long *row = m[i];
for (j=0; j < COLS; j++) {
sum += row[j];
}
}


If you cannot do this, that means that your data is wrongly oriented.






share|improve this answer


























  • Or you could just use the 2d notation in the more favorable order.

    – Robert Harvey
    7 hours ago











  • yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

    – Jean-François Fabre
    7 hours ago





















1














This is an issue with the cache.



The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.






share|improve this answer































    1














    A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.



    When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.



    By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)






    share|improve this answer































      1














      To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.



      http://valgrind.org/docs/manual/cg-manual.html






      share|improve this answer






























        6 Answers
        6






        active

        oldest

        votes








        6 Answers
        6






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        16














        Computers generally use cache to help speed up access to main memory.



        The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.



        Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.



        The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.



        So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.



        Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.



        Related Resource: Memory layout of multi-dimensional arrays






        share|improve this answer


























        • Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

          – jamesqf
          3 hours ago
















        16














        Computers generally use cache to help speed up access to main memory.



        The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.



        Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.



        The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.



        So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.



        Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.



        Related Resource: Memory layout of multi-dimensional arrays






        share|improve this answer


























        • Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

          – jamesqf
          3 hours ago














        16












        16








        16







        Computers generally use cache to help speed up access to main memory.



        The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.



        Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.



        The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.



        So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.



        Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.



        Related Resource: Memory layout of multi-dimensional arrays






        share|improve this answer















        Computers generally use cache to help speed up access to main memory.



        The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.



        Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.



        The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.



        So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.



        Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.



        Related Resource: Memory layout of multi-dimensional arrays







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 4 hours ago









        Kunal

        313




        313










        answered 7 hours ago









        Eric PostpischilEric Postpischil

        76.8k880163




        76.8k880163













        • Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

          – jamesqf
          3 hours ago



















        • Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

          – jamesqf
          3 hours ago

















        Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

        – jamesqf
        3 hours ago





        Further, the MMU will fetch lines of data - everything in a range of consecutive memory addresses - into the cache in one operation. So if you're operating on data that's consecutive in memory - like a 1D array or 1D slice of a 2D array - then when array [i] is first accessed, [i+1], [i+2]...[i+n] are automatically prefetched into cache.

        – jamesqf
        3 hours ago













        22














        C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:




        Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).




        Emphasis mine.



        Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:



        row and column major ordering



        The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.






        share|improve this answer





















        • 3





          Upvoted for actually citing the standard.

          – Robert Harvey
          7 hours ago


















        22














        C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:




        Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).




        Emphasis mine.



        Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:



        row and column major ordering



        The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.






        share|improve this answer





















        • 3





          Upvoted for actually citing the standard.

          – Robert Harvey
          7 hours ago
















        22












        22








        22







        C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:




        Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).




        Emphasis mine.



        Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:



        row and column major ordering



        The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.






        share|improve this answer















        C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:




        Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).




        Emphasis mine.



        Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:



        row and column major ordering



        The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 6 hours ago

























        answered 7 hours ago









        Govind ParmarGovind Parmar

        11.3k53361




        11.3k53361








        • 3





          Upvoted for actually citing the standard.

          – Robert Harvey
          7 hours ago
















        • 3





          Upvoted for actually citing the standard.

          – Robert Harvey
          7 hours ago










        3




        3





        Upvoted for actually citing the standard.

        – Robert Harvey
        7 hours ago







        Upvoted for actually citing the standard.

        – Robert Harvey
        7 hours ago













        2














        On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).



        By "skipping" data (reading far from the previous read), the CPU has to read the memory again.



        That's why your first snippet is faster.



        For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.



        To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.



        for (i=0; i < ROWS; i++) {
        const long *row = m[i];
        for (j=0; j < COLS; j++) {
        sum += row[j];
        }
        }


        If you cannot do this, that means that your data is wrongly oriented.






        share|improve this answer


























        • Or you could just use the 2d notation in the more favorable order.

          – Robert Harvey
          7 hours ago











        • yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

          – Jean-François Fabre
          7 hours ago


















        2














        On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).



        By "skipping" data (reading far from the previous read), the CPU has to read the memory again.



        That's why your first snippet is faster.



        For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.



        To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.



        for (i=0; i < ROWS; i++) {
        const long *row = m[i];
        for (j=0; j < COLS; j++) {
        sum += row[j];
        }
        }


        If you cannot do this, that means that your data is wrongly oriented.






        share|improve this answer


























        • Or you could just use the 2d notation in the more favorable order.

          – Robert Harvey
          7 hours ago











        • yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

          – Jean-François Fabre
          7 hours ago
















        2












        2








        2







        On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).



        By "skipping" data (reading far from the previous read), the CPU has to read the memory again.



        That's why your first snippet is faster.



        For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.



        To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.



        for (i=0; i < ROWS; i++) {
        const long *row = m[i];
        for (j=0; j < COLS; j++) {
        sum += row[j];
        }
        }


        If you cannot do this, that means that your data is wrongly oriented.






        share|improve this answer















        On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).



        By "skipping" data (reading far from the previous read), the CPU has to read the memory again.



        That's why your first snippet is faster.



        For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.



        To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.



        for (i=0; i < ROWS; i++) {
        const long *row = m[i];
        for (j=0; j < COLS; j++) {
        sum += row[j];
        }
        }


        If you cannot do this, that means that your data is wrongly oriented.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 7 hours ago

























        answered 7 hours ago









        Jean-François FabreJean-François Fabre

        104k955112




        104k955112













        • Or you could just use the 2d notation in the more favorable order.

          – Robert Harvey
          7 hours ago











        • yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

          – Jean-François Fabre
          7 hours ago





















        • Or you could just use the 2d notation in the more favorable order.

          – Robert Harvey
          7 hours ago











        • yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

          – Jean-François Fabre
          7 hours ago



















        Or you could just use the 2d notation in the more favorable order.

        – Robert Harvey
        7 hours ago





        Or you could just use the 2d notation in the more favorable order.

        – Robert Harvey
        7 hours ago













        yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

        – Jean-François Fabre
        7 hours ago







        yes, but I like the concept of context: select row, work on the row. And the inner loop could be moved to another 1D-only sum/product/whatever function which uses SSE or whatever. Also, avoids a non-optimizing compiler to compute the indexes each time.

        – Jean-François Fabre
        7 hours ago













        1














        This is an issue with the cache.



        The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.






        share|improve this answer




























          1














          This is an issue with the cache.



          The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.






          share|improve this answer


























            1












            1








            1







            This is an issue with the cache.



            The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.






            share|improve this answer













            This is an issue with the cache.



            The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 7 hours ago









            BromanBroman

            6,679112342




            6,679112342























                1














                A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.



                When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.



                By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)






                share|improve this answer




























                  1














                  A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.



                  When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.



                  By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)






                  share|improve this answer


























                    1












                    1








                    1







                    A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.



                    When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.



                    By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)






                    share|improve this answer













                    A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.



                    When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.



                    By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered 7 hours ago









                    Christian GibbonsChristian Gibbons

                    1,885614




                    1,885614























                        1














                        To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.



                        http://valgrind.org/docs/manual/cg-manual.html






                        share|improve this answer




























                          1














                          To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.



                          http://valgrind.org/docs/manual/cg-manual.html






                          share|improve this answer


























                            1












                            1








                            1







                            To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.



                            http://valgrind.org/docs/manual/cg-manual.html






                            share|improve this answer













                            To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.



                            http://valgrind.org/docs/manual/cg-manual.html







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 4 hours ago









                            CSMCSM

                            1,080179




                            1,080179