Sparse 3D binning/histogram function











up vote
8
down vote

favorite
1












This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.



The goal is to take a Python list of coordinates in 3D, optionally with associated weights, define a cubic grid over the coordinates which divides the points into a variable number of "bins" or cells, and then computes the total number of points in each bin (or the sum of the weights). These sums are then associated to either the coordinates or indices of the bin.



Obviously there are existing functions in Python libraries which do this, namely numpy.histogram, scipy.stats.histogram, and less obviously numpy.unique. The reason I didn't use those is that I had to process huge catalogs of ${sim}10^6$ galaxies or more, and I wanted to make fairly small bins. The histogram functions store empty bins in memory, so I would often run out of memory trying to store huge numpy arrays of mostly zeros. numpy.unique avoids this, but it cannot handle summing weights instead of just counts.



So I created this function, abusing using the defaultdict subclass of the native Python dictionary to gain the summing feature. I found it to be both fast enough and that it solved my memory problems, but I am interested in improving it.



from collections import defaultdict

"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],
optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse
histogram (i.e. no empty bins) with bins of resolution (spacing) given by
res.
The weights option allows you to chose to histogram over counts
instead of weights (equivalent to all weights being 1).
The bin_index option lets you return the points with their bin indices
(the integers representing how many bins in each direction to walk to
find the specified bin) rather than centerpoint coordinates."""
def sparse_hist(points, res, weights=True, bin_index=False):
def _binindex(point):
point = point[:3]
bi = [int(x//res) for x in point]
bi = tuple(bi)
return bi

def _bincenter(point):
point = point[:3]
bc = [(x//res+0.5)*res for x in point]
bc = tuple(bc)
return bc

if not bin_index:
if weights:
pointlist = [(_bincenter(x), x[3]) for x in points]
else:
pointlist = [(_bincenter(x), 1) for x in points]
else:
if weights:
pointlist = [(_binindex(x), x[3]) for x in points]
else:
pointlist = [(_binindex(x), 1) for x in points]

pointdict = defaultdict(list)
for k,v in pointlist:
pointdict[k].append(v)

for key,val in pointdict.items():
val = sum(val)
pointdict.update({key:val})

return pointdict









share|improve this question









New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
    – leftaroundabout
    Dec 3 at 16:50










  • @leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
    – Davis
    2 days ago

















up vote
8
down vote

favorite
1












This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.



The goal is to take a Python list of coordinates in 3D, optionally with associated weights, define a cubic grid over the coordinates which divides the points into a variable number of "bins" or cells, and then computes the total number of points in each bin (or the sum of the weights). These sums are then associated to either the coordinates or indices of the bin.



Obviously there are existing functions in Python libraries which do this, namely numpy.histogram, scipy.stats.histogram, and less obviously numpy.unique. The reason I didn't use those is that I had to process huge catalogs of ${sim}10^6$ galaxies or more, and I wanted to make fairly small bins. The histogram functions store empty bins in memory, so I would often run out of memory trying to store huge numpy arrays of mostly zeros. numpy.unique avoids this, but it cannot handle summing weights instead of just counts.



So I created this function, abusing using the defaultdict subclass of the native Python dictionary to gain the summing feature. I found it to be both fast enough and that it solved my memory problems, but I am interested in improving it.



from collections import defaultdict

"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],
optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse
histogram (i.e. no empty bins) with bins of resolution (spacing) given by
res.
The weights option allows you to chose to histogram over counts
instead of weights (equivalent to all weights being 1).
The bin_index option lets you return the points with their bin indices
(the integers representing how many bins in each direction to walk to
find the specified bin) rather than centerpoint coordinates."""
def sparse_hist(points, res, weights=True, bin_index=False):
def _binindex(point):
point = point[:3]
bi = [int(x//res) for x in point]
bi = tuple(bi)
return bi

def _bincenter(point):
point = point[:3]
bc = [(x//res+0.5)*res for x in point]
bc = tuple(bc)
return bc

if not bin_index:
if weights:
pointlist = [(_bincenter(x), x[3]) for x in points]
else:
pointlist = [(_bincenter(x), 1) for x in points]
else:
if weights:
pointlist = [(_binindex(x), x[3]) for x in points]
else:
pointlist = [(_binindex(x), 1) for x in points]

pointdict = defaultdict(list)
for k,v in pointlist:
pointdict[k].append(v)

for key,val in pointdict.items():
val = sum(val)
pointdict.update({key:val})

return pointdict









share|improve this question









New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
    – leftaroundabout
    Dec 3 at 16:50










  • @leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
    – Davis
    2 days ago















up vote
8
down vote

favorite
1









up vote
8
down vote

favorite
1






1





This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.



The goal is to take a Python list of coordinates in 3D, optionally with associated weights, define a cubic grid over the coordinates which divides the points into a variable number of "bins" or cells, and then computes the total number of points in each bin (or the sum of the weights). These sums are then associated to either the coordinates or indices of the bin.



Obviously there are existing functions in Python libraries which do this, namely numpy.histogram, scipy.stats.histogram, and less obviously numpy.unique. The reason I didn't use those is that I had to process huge catalogs of ${sim}10^6$ galaxies or more, and I wanted to make fairly small bins. The histogram functions store empty bins in memory, so I would often run out of memory trying to store huge numpy arrays of mostly zeros. numpy.unique avoids this, but it cannot handle summing weights instead of just counts.



So I created this function, abusing using the defaultdict subclass of the native Python dictionary to gain the summing feature. I found it to be both fast enough and that it solved my memory problems, but I am interested in improving it.



from collections import defaultdict

"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],
optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse
histogram (i.e. no empty bins) with bins of resolution (spacing) given by
res.
The weights option allows you to chose to histogram over counts
instead of weights (equivalent to all weights being 1).
The bin_index option lets you return the points with their bin indices
(the integers representing how many bins in each direction to walk to
find the specified bin) rather than centerpoint coordinates."""
def sparse_hist(points, res, weights=True, bin_index=False):
def _binindex(point):
point = point[:3]
bi = [int(x//res) for x in point]
bi = tuple(bi)
return bi

def _bincenter(point):
point = point[:3]
bc = [(x//res+0.5)*res for x in point]
bc = tuple(bc)
return bc

if not bin_index:
if weights:
pointlist = [(_bincenter(x), x[3]) for x in points]
else:
pointlist = [(_bincenter(x), 1) for x in points]
else:
if weights:
pointlist = [(_binindex(x), x[3]) for x in points]
else:
pointlist = [(_binindex(x), 1) for x in points]

pointdict = defaultdict(list)
for k,v in pointlist:
pointdict[k].append(v)

for key,val in pointdict.items():
val = sum(val)
pointdict.update({key:val})

return pointdict









share|improve this question









New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.



The goal is to take a Python list of coordinates in 3D, optionally with associated weights, define a cubic grid over the coordinates which divides the points into a variable number of "bins" or cells, and then computes the total number of points in each bin (or the sum of the weights). These sums are then associated to either the coordinates or indices of the bin.



Obviously there are existing functions in Python libraries which do this, namely numpy.histogram, scipy.stats.histogram, and less obviously numpy.unique. The reason I didn't use those is that I had to process huge catalogs of ${sim}10^6$ galaxies or more, and I wanted to make fairly small bins. The histogram functions store empty bins in memory, so I would often run out of memory trying to store huge numpy arrays of mostly zeros. numpy.unique avoids this, but it cannot handle summing weights instead of just counts.



So I created this function, abusing using the defaultdict subclass of the native Python dictionary to gain the summing feature. I found it to be both fast enough and that it solved my memory problems, but I am interested in improving it.



from collections import defaultdict

"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],
optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse
histogram (i.e. no empty bins) with bins of resolution (spacing) given by
res.
The weights option allows you to chose to histogram over counts
instead of weights (equivalent to all weights being 1).
The bin_index option lets you return the points with their bin indices
(the integers representing how many bins in each direction to walk to
find the specified bin) rather than centerpoint coordinates."""
def sparse_hist(points, res, weights=True, bin_index=False):
def _binindex(point):
point = point[:3]
bi = [int(x//res) for x in point]
bi = tuple(bi)
return bi

def _bincenter(point):
point = point[:3]
bc = [(x//res+0.5)*res for x in point]
bc = tuple(bc)
return bc

if not bin_index:
if weights:
pointlist = [(_bincenter(x), x[3]) for x in points]
else:
pointlist = [(_bincenter(x), 1) for x in points]
else:
if weights:
pointlist = [(_binindex(x), x[3]) for x in points]
else:
pointlist = [(_binindex(x), 1) for x in points]

pointdict = defaultdict(list)
for k,v in pointlist:
pointdict[k].append(v)

for key,val in pointdict.items():
val = sum(val)
pointdict.update({key:val})

return pointdict






python python-3.x statistics memory-optimization






share|improve this question









New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited Dec 3 at 10:25





















New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Dec 3 at 0:30









Davis

434




434




New contributor




Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
    – leftaroundabout
    Dec 3 at 16:50










  • @leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
    – Davis
    2 days ago




















  • Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
    – leftaroundabout
    Dec 3 at 16:50










  • @leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
    – Davis
    2 days ago


















Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50




Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50












@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago






@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago












2 Answers
2






active

oldest

votes

















up vote
5
down vote



accepted










Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:



pointdict = defaultdict(list)
for k,v in pointlist:
pointdict[k].append(v)
for key,val in pointdict.items():
val = sum(val)
pointdict.update({key:val})


And instead sum directly:



histogram = defaultdict(int)
for i, weight in pointlist:
histogram[i] += weight


I also renamed the variables so it is a bit clearer what they represent.





Instead of making pointlist a list, rename it to points (or something else if you want to avoid reusing the name) and make it a generator. Currently your memory requirements are $mathcal{O}(n+m)$, where $n$ is the number of points you have (since you store them in an intermediate list) and $m$ is the number of bins (in your output dictionary). If you used a generator, this would drop to $mathcal{O}(m)$.



points = ((_bincenter(x), x[3]) for x in points)





share|improve this answer























  • I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
    – Davis
    yesterday


















up vote
7
down vote













Some suggestions:




  • Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

  • Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

  • Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.


  • Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:



    if bin_index:
    binning_function = _bincenter
    else:


    if weights:



  • Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

  • Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

  • I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.


Rather than using a defaultdict(list) you can just pointdict.get(key, ) if you know the dict is sparse. This is purposely not a suggestion because whether it's appropriate or not depends on the programming environment, but it is an option.






share|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    Davis is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208899%2fsparse-3d-binning-histogram-function%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    5
    down vote



    accepted










    Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:



    pointdict = defaultdict(list)
    for k,v in pointlist:
    pointdict[k].append(v)
    for key,val in pointdict.items():
    val = sum(val)
    pointdict.update({key:val})


    And instead sum directly:



    histogram = defaultdict(int)
    for i, weight in pointlist:
    histogram[i] += weight


    I also renamed the variables so it is a bit clearer what they represent.





    Instead of making pointlist a list, rename it to points (or something else if you want to avoid reusing the name) and make it a generator. Currently your memory requirements are $mathcal{O}(n+m)$, where $n$ is the number of points you have (since you store them in an intermediate list) and $m$ is the number of bins (in your output dictionary). If you used a generator, this would drop to $mathcal{O}(m)$.



    points = ((_bincenter(x), x[3]) for x in points)





    share|improve this answer























    • I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
      – Davis
      yesterday















    up vote
    5
    down vote



    accepted










    Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:



    pointdict = defaultdict(list)
    for k,v in pointlist:
    pointdict[k].append(v)
    for key,val in pointdict.items():
    val = sum(val)
    pointdict.update({key:val})


    And instead sum directly:



    histogram = defaultdict(int)
    for i, weight in pointlist:
    histogram[i] += weight


    I also renamed the variables so it is a bit clearer what they represent.





    Instead of making pointlist a list, rename it to points (or something else if you want to avoid reusing the name) and make it a generator. Currently your memory requirements are $mathcal{O}(n+m)$, where $n$ is the number of points you have (since you store them in an intermediate list) and $m$ is the number of bins (in your output dictionary). If you used a generator, this would drop to $mathcal{O}(m)$.



    points = ((_bincenter(x), x[3]) for x in points)





    share|improve this answer























    • I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
      – Davis
      yesterday













    up vote
    5
    down vote



    accepted







    up vote
    5
    down vote



    accepted






    Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:



    pointdict = defaultdict(list)
    for k,v in pointlist:
    pointdict[k].append(v)
    for key,val in pointdict.items():
    val = sum(val)
    pointdict.update({key:val})


    And instead sum directly:



    histogram = defaultdict(int)
    for i, weight in pointlist:
    histogram[i] += weight


    I also renamed the variables so it is a bit clearer what they represent.





    Instead of making pointlist a list, rename it to points (or something else if you want to avoid reusing the name) and make it a generator. Currently your memory requirements are $mathcal{O}(n+m)$, where $n$ is the number of points you have (since you store them in an intermediate list) and $m$ is the number of bins (in your output dictionary). If you used a generator, this would drop to $mathcal{O}(m)$.



    points = ((_bincenter(x), x[3]) for x in points)





    share|improve this answer














    Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:



    pointdict = defaultdict(list)
    for k,v in pointlist:
    pointdict[k].append(v)
    for key,val in pointdict.items():
    val = sum(val)
    pointdict.update({key:val})


    And instead sum directly:



    histogram = defaultdict(int)
    for i, weight in pointlist:
    histogram[i] += weight


    I also renamed the variables so it is a bit clearer what they represent.





    Instead of making pointlist a list, rename it to points (or something else if you want to avoid reusing the name) and make it a generator. Currently your memory requirements are $mathcal{O}(n+m)$, where $n$ is the number of points you have (since you store them in an intermediate list) and $m$ is the number of bins (in your output dictionary). If you used a generator, this would drop to $mathcal{O}(m)$.



    points = ((_bincenter(x), x[3]) for x in points)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 3 at 13:54

























    answered Dec 3 at 9:41









    Graipher

    22.8k53384




    22.8k53384












    • I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
      – Davis
      yesterday


















    • I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
      – Davis
      yesterday
















    I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
    – Davis
    yesterday




    I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
    – Davis
    yesterday












    up vote
    7
    down vote













    Some suggestions:




    • Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

    • Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

    • Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.


    • Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:



      if bin_index:
      binning_function = _bincenter
      else:


      if weights:



    • Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

    • Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

    • I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.


    Rather than using a defaultdict(list) you can just pointdict.get(key, ) if you know the dict is sparse. This is purposely not a suggestion because whether it's appropriate or not depends on the programming environment, but it is an option.






    share|improve this answer

























      up vote
      7
      down vote













      Some suggestions:




      • Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

      • Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

      • Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.


      • Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:



        if bin_index:
        binning_function = _bincenter
        else:


        if weights:



      • Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

      • Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

      • I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.


      Rather than using a defaultdict(list) you can just pointdict.get(key, ) if you know the dict is sparse. This is purposely not a suggestion because whether it's appropriate or not depends on the programming environment, but it is an option.






      share|improve this answer























        up vote
        7
        down vote










        up vote
        7
        down vote









        Some suggestions:




        • Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

        • Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

        • Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.


        • Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:



          if bin_index:
          binning_function = _bincenter
          else:


          if weights:



        • Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

        • Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

        • I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.


        Rather than using a defaultdict(list) you can just pointdict.get(key, ) if you know the dict is sparse. This is purposely not a suggestion because whether it's appropriate or not depends on the programming environment, but it is an option.






        share|improve this answer












        Some suggestions:




        • Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

        • Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

        • Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.


        • Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:



          if bin_index:
          binning_function = _bincenter
          else:


          if weights:



        • Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

        • Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

        • I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.


        Rather than using a defaultdict(list) you can just pointdict.get(key, ) if you know the dict is sparse. This is purposely not a suggestion because whether it's appropriate or not depends on the programming environment, but it is an option.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 3 at 2:08









        l0b0

        4,112923




        4,112923






















            Davis is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            Davis is a new contributor. Be nice, and check out our Code of Conduct.













            Davis is a new contributor. Be nice, and check out our Code of Conduct.












            Davis is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208899%2fsparse-3d-binning-histogram-function%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown