CUDA kernel fails to launch after making simple code changes inside the kernel

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

-1

I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.

I've inserted the full kernel code at the end, but I'll try to exemplify the above first.

First define the first 2 contributions:

//contribution 1

VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);

//contribution 2

VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);

Now set output:

Case 1 - kernel launches and sets expected values:

V_pri[cell1_idx] = value1;

Case 2 - kernel launches and sets expected values:

V_pri[cell1_idx] = value2;

Case 3 - kernel does not launch:

V_pri[cell1_idx] = value1 + value2;

I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?

I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.

Here is the full kernel code:

template <typename VType, typename Class_CMBND>

__global__ void set_cmbnd_values_kernel(

cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,

Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,

CMBNDInfoCUDA& contact)

{



 int box_idx = blockIdx.x * blockDim.x + threadIdx.x;



 cuINT3 box_sizes = contact.cells_box.size();



 if (box_idx < box_sizes.dim()) {



     int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;

     int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;

     int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;



     cuReal hL = contact.hshift_secondary.norm();

     cuReal hR = contact.hshift_primary.norm();

     cuReal hmax = (hL > hR ? hL : hR);



     int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;



     if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;



     int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;



     cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;



     cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);



     VType V_2 = V_pri[cell2_idx];

     VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);



     //a values

     VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);

     VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);



     //b values adjusted with weights

     cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;

     cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;



     //V'' values at cell positions -1 and 1

     VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);

     VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);



     //Formula for V1

     V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)

        - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR

        + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);

 }

}

It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.

edited Nov 24 '18 at 8:47

asked Nov 23 '18 at 20:44

qshn

162

add a comment |

-1

I've inserted the full kernel code at the end, but I'll try to exemplify the above first.

First define the first 2 contributions:

//contribution 1

VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);

//contribution 2

VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);

Now set output:

Case 1 - kernel launches and sets expected values:

V_pri[cell1_idx] = value1;

Case 2 - kernel launches and sets expected values:

V_pri[cell1_idx] = value2;

Case 3 - kernel does not launch:

V_pri[cell1_idx] = value1 + value2;

I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?

I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.

Here is the full kernel code:

template <typename VType, typename Class_CMBND>

__global__ void set_cmbnd_values_kernel(

cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,

Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,

CMBNDInfoCUDA& contact)

{



 int box_idx = blockIdx.x * blockDim.x + threadIdx.x;



 cuINT3 box_sizes = contact.cells_box.size();



 if (box_idx < box_sizes.dim()) {



     int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;

     int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;

     int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;



     cuReal hL = contact.hshift_secondary.norm();

     cuReal hR = contact.hshift_primary.norm();

     cuReal hmax = (hL > hR ? hL : hR);



     int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;



     if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;



     int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;



     cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;



     cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);



     VType V_2 = V_pri[cell2_idx];

     VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);



     //a values

     VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);

     VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);



     //b values adjusted with weights

     cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;

     cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;



     //V'' values at cell positions -1 and 1

     VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);

     VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);



     //Formula for V1

     V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)

        - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR

        + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);

 }

}

It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.

edited Nov 24 '18 at 8:47

asked Nov 23 '18 at 20:44

qshn

162

add a comment |

-1

I've inserted the full kernel code at the end, but I'll try to exemplify the above first.

First define the first 2 contributions:

//contribution 1

VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);

//contribution 2

VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);

Now set output:

Case 1 - kernel launches and sets expected values:

V_pri[cell1_idx] = value1;

Case 2 - kernel launches and sets expected values:

V_pri[cell1_idx] = value2;

Case 3 - kernel does not launch:

V_pri[cell1_idx] = value1 + value2;

I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?

I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.

Here is the full kernel code:

template <typename VType, typename Class_CMBND>

__global__ void set_cmbnd_values_kernel(

cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,

Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,

CMBNDInfoCUDA& contact)

{



 int box_idx = blockIdx.x * blockDim.x + threadIdx.x;



 cuINT3 box_sizes = contact.cells_box.size();



 if (box_idx < box_sizes.dim()) {



     int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;

     int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;

     int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;



     cuReal hL = contact.hshift_secondary.norm();

     cuReal hR = contact.hshift_primary.norm();

     cuReal hmax = (hL > hR ? hL : hR);



     int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;



     if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;



     int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;



     cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;



     cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);



     VType V_2 = V_pri[cell2_idx];

     VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);



     //a values

     VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);

     VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);



     //b values adjusted with weights

     cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;

     cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;



     //V'' values at cell positions -1 and 1

     VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);

     VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);



     //Formula for V1

     V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)

        - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR

        + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);

 }

}

It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.

edited Nov 24 '18 at 8:47

asked Nov 23 '18 at 20:44

qshn

162

I've inserted the full kernel code at the end, but I'll try to exemplify the above first.

First define the first 2 contributions:

//contribution 1

VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);

//contribution 2

VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);

Now set output:

Case 1 - kernel launches and sets expected values:

V_pri[cell1_idx] = value1;

Case 2 - kernel launches and sets expected values:

V_pri[cell1_idx] = value2;

Case 3 - kernel does not launch:

V_pri[cell1_idx] = value1 + value2;

I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?

I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.

Here is the full kernel code:

template <typename VType, typename Class_CMBND>

__global__ void set_cmbnd_values_kernel(

cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,

Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,

CMBNDInfoCUDA& contact)

{



 int box_idx = blockIdx.x * blockDim.x + threadIdx.x;



 cuINT3 box_sizes = contact.cells_box.size();



 if (box_idx < box_sizes.dim()) {



     int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;

     int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;

     int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;



     cuReal hL = contact.hshift_secondary.norm();

     cuReal hR = contact.hshift_primary.norm();

     cuReal hmax = (hL > hR ? hL : hR);



     int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;



     if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;



     int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;



     cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;



     cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);



     VType V_2 = V_pri[cell2_idx];

     VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);



     //a values

     VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);

     VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);



     //b values adjusted with weights

     cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;

     cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;



     //V'' values at cell positions -1 and 1

     VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);

     VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);



     //Formula for V1

     V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)

        - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR

        + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);

 }

}

It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.

cuda scientific-computing

edited Nov 24 '18 at 8:47

asked Nov 23 '18 at 20:44

qshn

162

edited Nov 24 '18 at 8:47

asked Nov 23 '18 at 20:44

qshn

162

edited Nov 24 '18 at 8:47

asked Nov 23 '18 at 20:44

qshn

162

asked Nov 23 '18 at 20:44

qshn

162

asked Nov 23 '18 at 20:44

qshn

162

add a comment |

1 Answer
1

active

oldest

votes

Right, seems I found the answer to my problem.

Looking at the generated errors I get "Too many resources requested for launch".

I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.

answered Nov 24 '18 at 8:50

qshn

162

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452753%2fcuda-kernel-fails-to-launch-after-making-simple-code-changes-inside-the-kernel%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Right, seems I found the answer to my problem.

Looking at the generated errors I get "Too many resources requested for launch".

I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.

answered Nov 24 '18 at 8:50

qshn

162

add a comment |

Right, seems I found the answer to my problem.

Looking at the generated errors I get "Too many resources requested for launch".

I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.

answered Nov 24 '18 at 8:50

qshn

162

add a comment |

Right, seems I found the answer to my problem.

Looking at the generated errors I get "Too many resources requested for launch".

I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.

answered Nov 24 '18 at 8:50

qshn

162

Right, seems I found the answer to my problem.

Looking at the generated errors I get "Too many resources requested for launch".

I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.

answered Nov 24 '18 at 8:50

qshn

162

answered Nov 24 '18 at 8:50

qshn

162

answered Nov 24 '18 at 8:50

qshn

162

answered Nov 24 '18 at 8:50

qshn

162

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr