why can't I get the right sum of 1D array with numba (cuda python)?

I try to use cuda python with numba.
The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 with numba
+ CUDA8.0

import os,sys,time

import pandas as pd

import numpy as np

from numba import cuda, float32



os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmbinnvvm64_31_0.dll'

os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmlibdevice'



bpg = (1,1) 

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):

    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    L = len(D)

    su = 0

    while index_i<L:

        su +=D[index_i]

        index_i +=bh

    print('su:',su)

    T[0,0]=su

    print('T:',T[0,0])





D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,

               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)

T = np.empty([1,1])

print('D: ',D)



stream = cuda.stream()

with stream.auto_synchronize():

    dD = cuda.to_device(D, stream)

    dT= cuda.to_device(TE, stream)

    calcu_sum[bpg, tpb, stream](dD,dT)

The output is:

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972

  0.43940639  0.43864551  0.43447691  0.43120232]

su:  1.733004

su:  1.289852

su:  1.291317

T: 1.733004

T: 1.289852

T: 1.291317

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

asked Mar 29 '17 at 7:58

glen

62111

The edits I made to your question were grammatically correct English. Please don't change them back to something that makes no sense.

– talonmies
Mar 29 '17 at 9:24

add a comment |

I try to use cuda python with numba.
The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 with numba
+ CUDA8.0

import os,sys,time

import pandas as pd

import numpy as np

from numba import cuda, float32



os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmbinnvvm64_31_0.dll'

os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmlibdevice'



bpg = (1,1) 

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):

    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    L = len(D)

    su = 0

    while index_i<L:

        su +=D[index_i]

        index_i +=bh

    print('su:',su)

    T[0,0]=su

    print('T:',T[0,0])





D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,

               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)

T = np.empty([1,1])

print('D: ',D)



stream = cuda.stream()

with stream.auto_synchronize():

    dD = cuda.to_device(D, stream)

    dT= cuda.to_device(TE, stream)

    calcu_sum[bpg, tpb, stream](dD,dT)

The output is:

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972

  0.43940639  0.43864551  0.43447691  0.43120232]

su:  1.733004

su:  1.289852

su:  1.291317

T: 1.733004

T: 1.289852

T: 1.291317

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

asked Mar 29 '17 at 7:58

glen

62111

The edits I made to your question were grammatically correct English. Please don't change them back to something that makes no sense.

– talonmies
Mar 29 '17 at 9:24

add a comment |

I try to use cuda python with numba.
The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 with numba
+ CUDA8.0

import os,sys,time

import pandas as pd

import numpy as np

from numba import cuda, float32



os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmbinnvvm64_31_0.dll'

os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmlibdevice'



bpg = (1,1) 

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):

    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    L = len(D)

    su = 0

    while index_i<L:

        su +=D[index_i]

        index_i +=bh

    print('su:',su)

    T[0,0]=su

    print('T:',T[0,0])





D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,

               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)

T = np.empty([1,1])

print('D: ',D)



stream = cuda.stream()

with stream.auto_synchronize():

    dD = cuda.to_device(D, stream)

    dT= cuda.to_device(TE, stream)

    calcu_sum[bpg, tpb, stream](dD,dT)

The output is:

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972

  0.43940639  0.43864551  0.43447691  0.43120232]

su:  1.733004

su:  1.289852

su:  1.291317

T: 1.733004

T: 1.289852

T: 1.291317

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

asked Mar 29 '17 at 7:58

glen

62111

I try to use cuda python with numba.
The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 with numba
+ CUDA8.0

import os,sys,time

import pandas as pd

import numpy as np

from numba import cuda, float32



os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmbinnvvm64_31_0.dll'

os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0nvvmlibdevice'



bpg = (1,1) 

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):

    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    L = len(D)

    su = 0

    while index_i<L:

        su +=D[index_i]

        index_i +=bh

    print('su:',su)

    T[0,0]=su

    print('T:',T[0,0])





D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,

               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)

T = np.empty([1,1])

print('D: ',D)



stream = cuda.stream()

with stream.auto_synchronize():

    dD = cuda.to_device(D, stream)

    dT= cuda.to_device(TE, stream)

    calcu_sum[bpg, tpb, stream](dD,dT)

The output is:

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972

  0.43940639  0.43864551  0.43447691  0.43120232]

su:  1.733004

su:  1.289852

su:  1.291317

T: 1.733004

T: 1.289852

T: 1.291317

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?

python cuda numba numba-pro

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

asked Mar 29 '17 at 7:58

glen

62111

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

asked Mar 29 '17 at 7:58

glen

62111

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

edited Mar 29 '17 at 9:23

talonmies

59.3k17128197

asked Mar 29 '17 at 7:58

glen

62111

asked Mar 29 '17 at 7:58

glen

62111

asked Mar 29 '17 at 7:58

glen

62111

The edits I made to your question were grammatically correct English. Please don't change them back to something that makes no sense.

– talonmies
Mar 29 '17 at 9:24

add a comment |

The edits I made to your question were grammatically correct English. Please don't change them back to something that makes no sense.

– talonmies
Mar 29 '17 at 9:24

The edits I made to your question were grammatically correct English. Please don't change them back to something that makes no sense.

– talonmies
Mar 29 '17 at 9:24

add a comment |

1 Answer
1

active

oldest

votes

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. You have written code for each thread to read and sum part of the input array. You have not written any code for those threads to share and sum their individual partial sums into a final sum.

There is an extremely well described algorithm for doing this -- it is called a parallel reduction. You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here. You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):



    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    sbuf = cuda.shared.array(tpb, float32)



    L = len(D)

    su = 0

    while index_i < L:

        su += D[index_i]

        index_i +=bh



    print('su:',su)



    sbuf[0,ty] = su

    cuda.syncthreads()



    if ty == 0:

        T[0,0] = 0

        for i in range(0, bh):

            T[0,0] += sbuf[0,i]

        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

edited Mar 29 '17 at 16:08

community wiki

3 revs
talonmies

Thank you. I'll study it.

– glen
Mar 29 '17 at 12:19

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f43087668%2fwhy-cant-i-get-the-right-sum-of-1d-array-with-numba-cuda-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):



    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    sbuf = cuda.shared.array(tpb, float32)



    L = len(D)

    su = 0

    while index_i < L:

        su += D[index_i]

        index_i +=bh



    print('su:',su)



    sbuf[0,ty] = su

    cuda.syncthreads()



    if ty == 0:

        T[0,0] = 0

        for i in range(0, bh):

            T[0,0] += sbuf[0,i]

        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

edited Mar 29 '17 at 16:08

community wiki

3 revs
talonmies

Thank you. I'll study it.

– glen
Mar 29 '17 at 12:19

add a comment |

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):



    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    sbuf = cuda.shared.array(tpb, float32)



    L = len(D)

    su = 0

    while index_i < L:

        su += D[index_i]

        index_i +=bh



    print('su:',su)



    sbuf[0,ty] = su

    cuda.syncthreads()



    if ty == 0:

        T[0,0] = 0

        for i in range(0, bh):

            T[0,0] += sbuf[0,i]

        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

edited Mar 29 '17 at 16:08

community wiki

3 revs
talonmies

Thank you. I'll study it.

– glen
Mar 29 '17 at 12:19

add a comment |

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):



    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    sbuf = cuda.shared.array(tpb, float32)



    L = len(D)

    su = 0

    while index_i < L:

        su += D[index_i]

        index_i +=bh



    print('su:',su)



    sbuf[0,ty] = su

    cuda.syncthreads()



    if ty == 0:

        T[0,0] = 0

        for i in range(0, bh):

            T[0,0] += sbuf[0,i]

        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

edited Mar 29 '17 at 16:08

community wiki

3 revs
talonmies

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 



@cuda.jit

def calcu_sum(D,T):



    ty = cuda.threadIdx.y

    bh = cuda.blockDim.y

    index_i = ty

    sbuf = cuda.shared.array(tpb, float32)



    L = len(D)

    su = 0

    while index_i < L:

        su += D[index_i]

        index_i +=bh



    print('su:',su)



    sbuf[0,ty] = su

    cuda.syncthreads()



    if ty == 0:

        T[0,0] = 0

        for i in range(0, bh):

            T[0,0] += sbuf[0,i]

        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

edited Mar 29 '17 at 16:08

community wiki

3 revs
talonmies

edited Mar 29 '17 at 16:08

community wiki

3 revs
talonmies

community wiki

3 revs
talonmies

community wiki

3 revs
talonmies

Thank you. I'll study it.

– glen
Mar 29 '17 at 12:19

add a comment |

Thank you. I'll study it.

– glen
Mar 29 '17 at 12:19

Thank you. I'll study it.

– glen
Mar 29 '17 at 12:19

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

qQjtmA3b3VA7Fkjhgbkxa,ULg vpHfni ys,72aEcGQx6zPP,aex9eO 0vwSJ503Wc1s

搜尋此網誌

Argthtjtr