Is Highams' computation of mean worth the price?
$begingroup$
In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
$$
M_{1} := x_1, quad
M_{k+1} := M_{k} + frac{x_k - M_k}{k}
$$
Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.
But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.
So it Higham's update formula worth using? Are there benefits I'm not seeing?
Note: Higham gives the following generic advice:
"It is advantageous to express update formulae as newvalue = oldvalue +
smallcorrection if the small correction can be computed with many correct significant figures."
Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.
Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see
Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219
However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.
floating-point numerics
$endgroup$
add a comment |
$begingroup$
In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
$$
M_{1} := x_1, quad
M_{k+1} := M_{k} + frac{x_k - M_k}{k}
$$
Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.
But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.
So it Higham's update formula worth using? Are there benefits I'm not seeing?
Note: Higham gives the following generic advice:
"It is advantageous to express update formulae as newvalue = oldvalue +
smallcorrection if the small correction can be computed with many correct significant figures."
Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.
Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see
Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219
However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.
floating-point numerics
$endgroup$
add a comment |
$begingroup$
In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
$$
M_{1} := x_1, quad
M_{k+1} := M_{k} + frac{x_k - M_k}{k}
$$
Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.
But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.
So it Higham's update formula worth using? Are there benefits I'm not seeing?
Note: Higham gives the following generic advice:
"It is advantageous to express update formulae as newvalue = oldvalue +
smallcorrection if the small correction can be computed with many correct significant figures."
Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.
Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see
Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219
However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.
floating-point numerics
$endgroup$
In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
$$
M_{1} := x_1, quad
M_{k+1} := M_{k} + frac{x_k - M_k}{k}
$$
Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.
But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.
So it Higham's update formula worth using? Are there benefits I'm not seeing?
Note: Higham gives the following generic advice:
"It is advantageous to express update formulae as newvalue = oldvalue +
smallcorrection if the small correction can be computed with many correct significant figures."
Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.
Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see
Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219
However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.
floating-point numerics
floating-point numerics
edited 2 hours ago
user14717
asked 5 hours ago
user14717user14717
562211
562211
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.
Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.
Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.
$endgroup$
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
add a comment |
$begingroup$
Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.
But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:
#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
const int num_count = 100000;
alignas(16) float data[num_count];
for(int i=0;i<num_count;i++)
data[i] = rand()%100000+rand()/(double)RAND_MAX;
{
const auto t1 = std::chrono::high_resolution_clock::now();
float sum=0;
for(int i=0;i<num_count;i++){
sum += data[i];
}
const float mean=sum/num_count;
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
}
{
const auto t1 = std::chrono::high_resolution_clock::now();
__m128 mean = _mm_load_ps(&data[0]);
for (int i=4;i<num_count;i+=4){
const __m128 x = _mm_load_ps(&data[i]);
const __m128 diff = _mm_sub_ps(x,mean);
const __m128 k = _mm_set1_ps(i/4);
const __m128 div = _mm_div_ps(diff, k);
mean = _mm_add_ps(mean, div);
}
float result[4];
_mm_store_ps(result, mean);
const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
}
}
}
and observe that
Exec time: 0.000225851 s
Simple Mean = 49891.23046875
Exec time: 0.0003759360000000000002 s
Higham Mean = 49890.26171875
Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.
$endgroup$
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "363"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fscicomp.stackexchange.com%2fquestions%2f31071%2fis-highams-computation-of-mean-worth-the-price%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.
Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.
Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.
$endgroup$
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
add a comment |
$begingroup$
Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.
Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.
Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.
$endgroup$
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
add a comment |
$begingroup$
Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.
Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.
Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.
$endgroup$
Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.
Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.
Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.
answered 2 hours ago
Carl ChristianCarl Christian
98748
98748
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
add a comment |
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago
add a comment |
$begingroup$
Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.
But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:
#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
const int num_count = 100000;
alignas(16) float data[num_count];
for(int i=0;i<num_count;i++)
data[i] = rand()%100000+rand()/(double)RAND_MAX;
{
const auto t1 = std::chrono::high_resolution_clock::now();
float sum=0;
for(int i=0;i<num_count;i++){
sum += data[i];
}
const float mean=sum/num_count;
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
}
{
const auto t1 = std::chrono::high_resolution_clock::now();
__m128 mean = _mm_load_ps(&data[0]);
for (int i=4;i<num_count;i+=4){
const __m128 x = _mm_load_ps(&data[i]);
const __m128 diff = _mm_sub_ps(x,mean);
const __m128 k = _mm_set1_ps(i/4);
const __m128 div = _mm_div_ps(diff, k);
mean = _mm_add_ps(mean, div);
}
float result[4];
_mm_store_ps(result, mean);
const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
}
}
}
and observe that
Exec time: 0.000225851 s
Simple Mean = 49891.23046875
Exec time: 0.0003759360000000000002 s
Higham Mean = 49890.26171875
Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.
$endgroup$
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
add a comment |
$begingroup$
Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.
But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:
#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
const int num_count = 100000;
alignas(16) float data[num_count];
for(int i=0;i<num_count;i++)
data[i] = rand()%100000+rand()/(double)RAND_MAX;
{
const auto t1 = std::chrono::high_resolution_clock::now();
float sum=0;
for(int i=0;i<num_count;i++){
sum += data[i];
}
const float mean=sum/num_count;
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
}
{
const auto t1 = std::chrono::high_resolution_clock::now();
__m128 mean = _mm_load_ps(&data[0]);
for (int i=4;i<num_count;i+=4){
const __m128 x = _mm_load_ps(&data[i]);
const __m128 diff = _mm_sub_ps(x,mean);
const __m128 k = _mm_set1_ps(i/4);
const __m128 div = _mm_div_ps(diff, k);
mean = _mm_add_ps(mean, div);
}
float result[4];
_mm_store_ps(result, mean);
const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
}
}
}
and observe that
Exec time: 0.000225851 s
Simple Mean = 49891.23046875
Exec time: 0.0003759360000000000002 s
Higham Mean = 49890.26171875
Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.
$endgroup$
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
add a comment |
$begingroup$
Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.
But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:
#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
const int num_count = 100000;
alignas(16) float data[num_count];
for(int i=0;i<num_count;i++)
data[i] = rand()%100000+rand()/(double)RAND_MAX;
{
const auto t1 = std::chrono::high_resolution_clock::now();
float sum=0;
for(int i=0;i<num_count;i++){
sum += data[i];
}
const float mean=sum/num_count;
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
}
{
const auto t1 = std::chrono::high_resolution_clock::now();
__m128 mean = _mm_load_ps(&data[0]);
for (int i=4;i<num_count;i+=4){
const __m128 x = _mm_load_ps(&data[i]);
const __m128 diff = _mm_sub_ps(x,mean);
const __m128 k = _mm_set1_ps(i/4);
const __m128 div = _mm_div_ps(diff, k);
mean = _mm_add_ps(mean, div);
}
float result[4];
_mm_store_ps(result, mean);
const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
}
}
}
and observe that
Exec time: 0.000225851 s
Simple Mean = 49891.23046875
Exec time: 0.0003759360000000000002 s
Higham Mean = 49890.26171875
Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.
$endgroup$
Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.
But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:
#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
const int num_count = 100000;
alignas(16) float data[num_count];
for(int i=0;i<num_count;i++)
data[i] = rand()%100000+rand()/(double)RAND_MAX;
{
const auto t1 = std::chrono::high_resolution_clock::now();
float sum=0;
for(int i=0;i<num_count;i++){
sum += data[i];
}
const float mean=sum/num_count;
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
}
{
const auto t1 = std::chrono::high_resolution_clock::now();
__m128 mean = _mm_load_ps(&data[0]);
for (int i=4;i<num_count;i+=4){
const __m128 x = _mm_load_ps(&data[i]);
const __m128 diff = _mm_sub_ps(x,mean);
const __m128 k = _mm_set1_ps(i/4);
const __m128 div = _mm_div_ps(diff, k);
mean = _mm_add_ps(mean, div);
}
float result[4];
_mm_store_ps(result, mean);
const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:t" << time_span1.count() << " sn";
std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
}
}
}
and observe that
Exec time: 0.000225851 s
Simple Mean = 49891.23046875
Exec time: 0.0003759360000000000002 s
Higham Mean = 49890.26171875
Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.
answered 1 hour ago
RichardRichard
571210
571210
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
add a comment |
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago
add a comment |
Thanks for contributing an answer to Computational Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fscicomp.stackexchange.com%2fquestions%2f31071%2fis-highams-computation-of-mean-worth-the-price%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown