Is Highams' computation of mean worth the price?

In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
$$
M_{1} := x_1, quad
M_{k+1} := M_{k} + frac{x_k - M_k}{k}
$$
Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.

But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.

So it Higham's update formula worth using? Are there benefits I'm not seeing?

Note: Higham gives the following generic advice:
"It is advantageous to express update formulae as newvalue = oldvalue +
smallcorrection if the small correction can be computed with many correct significant figures."

Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.

Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see

Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219

However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.

edited 2 hours ago

asked 5 hours ago

user14717

562211

add a comment |

So it Higham's update formula worth using? Are there benefits I'm not seeing?

Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.

Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see

Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219

However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.

edited 2 hours ago

asked 5 hours ago

user14717

562211

add a comment |

So it Higham's update formula worth using? Are there benefits I'm not seeing?

Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.

Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see

Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219

However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.

edited 2 hours ago

asked 5 hours ago

user14717

562211

So it Higham's update formula worth using? Are there benefits I'm not seeing?

Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.

Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see

Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219

However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.

floating-point numerics

edited 2 hours ago

asked 5 hours ago

user14717

562211

edited 2 hours ago

asked 5 hours ago

user14717

562211

edited 2 hours ago

asked 5 hours ago

user14717

562211

asked 5 hours ago

user14717

562211

asked 5 hours ago

user14717

562211

add a comment |

2 Answers
2

active

oldest

votes

Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.

Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.

Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.

answered 2 hours ago

Carl Christian

98748

$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago

$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago

add a comment |

Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.

But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:

#include <chrono>

#include <cstdlib>

#include <iomanip>

#include <iostream>

#include <xmmintrin.h>



int main(){

  const int num_count = 100000;



  alignas(16) float data[num_count];

  for(int i=0;i<num_count;i++)

    data[i] = rand()%100000+rand()/(double)RAND_MAX;





  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    float sum=0;

    for(int i=0;i<num_count;i++){

      sum += data[i];

    }

    const float mean=sum/num_count;



    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;

  }



  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    __m128 mean = _mm_load_ps(&data[0]);

    for (int i=4;i<num_count;i+=4){

      const __m128 x    = _mm_load_ps(&data[i]);

      const __m128 diff = _mm_sub_ps(x,mean);

      const __m128 k    = _mm_set1_ps(i/4);

      const __m128 div  = _mm_div_ps(diff, k);

      mean              = _mm_add_ps(mean, div);

    }

    float result[4];

    _mm_store_ps(result, mean);

    const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision

    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;

  }





}



}

and observe that

Exec time:  0.000225851 s

Simple Mean = 49891.23046875

Exec time:  0.0003759360000000000002 s

Higham Mean = 49890.26171875

Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.

answered 1 hour ago

Richard

571210

$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "363"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fscicomp.stackexchange.com%2fquestions%2f31071%2fis-highams-computation-of-mean-worth-the-price%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered 2 hours ago

Carl Christian

98748

$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago

$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago

add a comment |

answered 2 hours ago

Carl Christian

98748

$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago

$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago

add a comment |

answered 2 hours ago

Carl Christian

98748

answered 2 hours ago

Carl Christian

98748

answered 2 hours ago

Carl Christian

98748

answered 2 hours ago

Carl Christian

98748

answered 2 hours ago

Carl Christian

98748

$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago

$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago

add a comment |

$begingroup$
I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
$endgroup$
– user14717
2 hours ago

$begingroup$
@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
$endgroup$
– Richard
1 hour ago

I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?

– user14717
2 hours ago

@user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.

– Richard
1 hour ago

add a comment |

Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.

But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:

#include <chrono>

#include <cstdlib>

#include <iomanip>

#include <iostream>

#include <xmmintrin.h>



int main(){

  const int num_count = 100000;



  alignas(16) float data[num_count];

  for(int i=0;i<num_count;i++)

    data[i] = rand()%100000+rand()/(double)RAND_MAX;





  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    float sum=0;

    for(int i=0;i<num_count;i++){

      sum += data[i];

    }

    const float mean=sum/num_count;



    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;

  }



  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    __m128 mean = _mm_load_ps(&data[0]);

    for (int i=4;i<num_count;i+=4){

      const __m128 x    = _mm_load_ps(&data[i]);

      const __m128 diff = _mm_sub_ps(x,mean);

      const __m128 k    = _mm_set1_ps(i/4);

      const __m128 div  = _mm_div_ps(diff, k);

      mean              = _mm_add_ps(mean, div);

    }

    float result[4];

    _mm_store_ps(result, mean);

    const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision

    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;

  }





}



}

and observe that

Exec time:  0.000225851 s

Simple Mean = 49891.23046875

Exec time:  0.0003759360000000000002 s

Higham Mean = 49890.26171875

Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.

answered 1 hour ago

Richard

571210

$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago

add a comment |

Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.

But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:

#include <chrono>

#include <cstdlib>

#include <iomanip>

#include <iostream>

#include <xmmintrin.h>



int main(){

  const int num_count = 100000;



  alignas(16) float data[num_count];

  for(int i=0;i<num_count;i++)

    data[i] = rand()%100000+rand()/(double)RAND_MAX;





  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    float sum=0;

    for(int i=0;i<num_count;i++){

      sum += data[i];

    }

    const float mean=sum/num_count;



    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;

  }



  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    __m128 mean = _mm_load_ps(&data[0]);

    for (int i=4;i<num_count;i+=4){

      const __m128 x    = _mm_load_ps(&data[i]);

      const __m128 diff = _mm_sub_ps(x,mean);

      const __m128 k    = _mm_set1_ps(i/4);

      const __m128 div  = _mm_div_ps(diff, k);

      mean              = _mm_add_ps(mean, div);

    }

    float result[4];

    _mm_store_ps(result, mean);

    const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision

    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;

  }





}



}

and observe that

Exec time:  0.000225851 s

Simple Mean = 49891.23046875

Exec time:  0.0003759360000000000002 s

Higham Mean = 49890.26171875

Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.

answered 1 hour ago

Richard

571210

$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago

add a comment |

Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.

But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:

#include <chrono>

#include <cstdlib>

#include <iomanip>

#include <iostream>

#include <xmmintrin.h>



int main(){

  const int num_count = 100000;



  alignas(16) float data[num_count];

  for(int i=0;i<num_count;i++)

    data[i] = rand()%100000+rand()/(double)RAND_MAX;





  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    float sum=0;

    for(int i=0;i<num_count;i++){

      sum += data[i];

    }

    const float mean=sum/num_count;



    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;

  }



  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    __m128 mean = _mm_load_ps(&data[0]);

    for (int i=4;i<num_count;i+=4){

      const __m128 x    = _mm_load_ps(&data[i]);

      const __m128 diff = _mm_sub_ps(x,mean);

      const __m128 k    = _mm_set1_ps(i/4);

      const __m128 div  = _mm_div_ps(diff, k);

      mean              = _mm_add_ps(mean, div);

    }

    float result[4];

    _mm_store_ps(result, mean);

    const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision

    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;

  }





}



}

and observe that

Exec time:  0.000225851 s

Simple Mean = 49891.23046875

Exec time:  0.0003759360000000000002 s

Higham Mean = 49890.26171875

Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.

answered 1 hour ago

Richard

571210

Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.

But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:

#include <chrono>

#include <cstdlib>

#include <iomanip>

#include <iostream>

#include <xmmintrin.h>



int main(){

  const int num_count = 100000;



  alignas(16) float data[num_count];

  for(int i=0;i<num_count;i++)

    data[i] = rand()%100000+rand()/(double)RAND_MAX;





  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    float sum=0;

    for(int i=0;i<num_count;i++){

      sum += data[i];

    }

    const float mean=sum/num_count;



    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;

  }



  {

    const auto t1 = std::chrono::high_resolution_clock::now();

    __m128 mean = _mm_load_ps(&data[0]);

    for (int i=4;i<num_count;i+=4){

      const __m128 x    = _mm_load_ps(&data[i]);

      const __m128 diff = _mm_sub_ps(x,mean);

      const __m128 k    = _mm_set1_ps(i/4);

      const __m128 div  = _mm_div_ps(diff, k);

      mean              = _mm_add_ps(mean, div);

    }

    float result[4];

    _mm_store_ps(result, mean);

    const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision

    const auto t2 = std::chrono::high_resolution_clock::now();

    const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    std::cout << "Exec time:t" << time_span1.count() << " sn";

    std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;

  }





}



}

and observe that

Exec time:  0.000225851 s

Simple Mean = 49891.23046875

Exec time:  0.0003759360000000000002 s

Higham Mean = 49890.26171875

Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.

answered 1 hour ago

Richard

571210

answered 1 hour ago

Richard

571210

answered 1 hour ago

Richard

571210

answered 1 hour ago

Richard

571210

$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago

add a comment |

$begingroup$
Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
$endgroup$
– user14717
1 hour ago

Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.

– user14717
1 hour ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Computational Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

YksS2YS kZlzozU8 4XtIl0ARX,3g9yw 744 n25gKcvX uk1nDNx8sUjM3VQ tQ HBRW

搜尋此網誌

Argthtjtr