Sparse 3D binning/histogram function

up vote
8
down vote

favorite

This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.

The goal is to take a Python list of coordinates in 3D, optionally with associated weights, define a cubic grid over the coordinates which divides the points into a variable number of "bins" or cells, and then computes the total number of points in each bin (or the sum of the weights). These sums are then associated to either the coordinates or indices of the bin.

Obviously there are existing functions in Python libraries which do this, namely numpy.histogram, scipy.stats.histogram, and less obviously numpy.unique. The reason I didn't use those is that I had to process huge catalogs of ${sim}10^6$ galaxies or more, and I wanted to make fairly small bins. The histogram functions store empty bins in memory, so I would often run out of memory trying to store huge numpy arrays of mostly zeros. numpy.unique avoids this, but it cannot handle summing weights instead of just counts.

So I created this function, ~~abusing~~ using the defaultdict subclass of the native Python dictionary to gain the summing feature. I found it to be both fast enough and that it solved my memory problems, but I am interested in improving it.

from collections import defaultdict



"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],

optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse

histogram (i.e. no empty bins) with bins of resolution (spacing) given by

res.

The weights option allows you to chose to histogram over counts

instead of weights (equivalent to all weights being 1).

The bin_index option lets you return the points with their bin indices 

(the integers representing how many bins in each direction to walk to 

find the specified bin) rather than centerpoint coordinates."""

def sparse_hist(points, res, weights=True, bin_index=False):

    def _binindex(point):

        point = point[:3]

        bi = [int(x//res) for x in point]

        bi = tuple(bi)

        return bi



    def _bincenter(point):

        point = point[:3]

        bc = [(x//res+0.5)*res for x in point]

        bc = tuple(bc)

        return bc



    if not bin_index:

        if weights:

            pointlist = [(_bincenter(x), x[3]) for x in points]

        else:

            pointlist = [(_bincenter(x), 1) for x in points]

    else:

        if weights:

            pointlist = [(_binindex(x), x[3]) for x in points]

        else:

            pointlist = [(_binindex(x), 1) for x in points]



    pointdict = defaultdict(list)

    for k,v in pointlist:

        pointdict[k].append(v)



    for key,val in pointdict.items():

        val = sum(val)

        pointdict.update({key:val})



    return pointdict

edited Dec 3 at 10:25

asked Dec 3 at 0:30

Davis

434

New contributor

Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50

@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago

add a comment |

up vote
8
down vote

favorite

This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.

from collections import defaultdict



"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],

optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse

histogram (i.e. no empty bins) with bins of resolution (spacing) given by

res.

The weights option allows you to chose to histogram over counts

instead of weights (equivalent to all weights being 1).

The bin_index option lets you return the points with their bin indices 

(the integers representing how many bins in each direction to walk to 

find the specified bin) rather than centerpoint coordinates."""

def sparse_hist(points, res, weights=True, bin_index=False):

    def _binindex(point):

        point = point[:3]

        bi = [int(x//res) for x in point]

        bi = tuple(bi)

        return bi



    def _bincenter(point):

        point = point[:3]

        bc = [(x//res+0.5)*res for x in point]

        bc = tuple(bc)

        return bc



    if not bin_index:

        if weights:

            pointlist = [(_bincenter(x), x[3]) for x in points]

        else:

            pointlist = [(_bincenter(x), 1) for x in points]

    else:

        if weights:

            pointlist = [(_binindex(x), x[3]) for x in points]

        else:

            pointlist = [(_binindex(x), 1) for x in points]



    pointdict = defaultdict(list)

    for k,v in pointlist:

        pointdict[k].append(v)



    for key,val in pointdict.items():

        val = sum(val)

        pointdict.update({key:val})



    return pointdict

edited Dec 3 at 10:25

asked Dec 3 at 0:30

Davis

434

New contributor

Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50

@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago

add a comment |

up vote
8
down vote

favorite

This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.

from collections import defaultdict



"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],

optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse

histogram (i.e. no empty bins) with bins of resolution (spacing) given by

res.

The weights option allows you to chose to histogram over counts

instead of weights (equivalent to all weights being 1).

The bin_index option lets you return the points with their bin indices 

(the integers representing how many bins in each direction to walk to 

find the specified bin) rather than centerpoint coordinates."""

def sparse_hist(points, res, weights=True, bin_index=False):

    def _binindex(point):

        point = point[:3]

        bi = [int(x//res) for x in point]

        bi = tuple(bi)

        return bi



    def _bincenter(point):

        point = point[:3]

        bc = [(x//res+0.5)*res for x in point]

        bc = tuple(bc)

        return bc



    if not bin_index:

        if weights:

            pointlist = [(_bincenter(x), x[3]) for x in points]

        else:

            pointlist = [(_bincenter(x), 1) for x in points]

    else:

        if weights:

            pointlist = [(_binindex(x), x[3]) for x in points]

        else:

            pointlist = [(_binindex(x), 1) for x in points]



    pointdict = defaultdict(list)

    for k,v in pointlist:

        pointdict[k].append(v)



    for key,val in pointdict.items():

        val = sum(val)

        pointdict.update({key:val})



    return pointdict

edited Dec 3 at 10:25

asked Dec 3 at 0:30

Davis

434

New contributor

This is a function written in Python I wrote to process galaxy catalogs for cosmology research, however it is more generally applicable than just my pipeline.

from collections import defaultdict



"""Accepts a python list of 3D spatial points, e.g. [[x1,y1,z1],...],

optionally with weights e.g. [[x1,x2,x3,w1],...], and returns the sparse

histogram (i.e. no empty bins) with bins of resolution (spacing) given by

res.

The weights option allows you to chose to histogram over counts

instead of weights (equivalent to all weights being 1).

The bin_index option lets you return the points with their bin indices 

(the integers representing how many bins in each direction to walk to 

find the specified bin) rather than centerpoint coordinates."""

def sparse_hist(points, res, weights=True, bin_index=False):

    def _binindex(point):

        point = point[:3]

        bi = [int(x//res) for x in point]

        bi = tuple(bi)

        return bi



    def _bincenter(point):

        point = point[:3]

        bc = [(x//res+0.5)*res for x in point]

        bc = tuple(bc)

        return bc



    if not bin_index:

        if weights:

            pointlist = [(_bincenter(x), x[3]) for x in points]

        else:

            pointlist = [(_bincenter(x), 1) for x in points]

    else:

        if weights:

            pointlist = [(_binindex(x), x[3]) for x in points]

        else:

            pointlist = [(_binindex(x), 1) for x in points]



    pointdict = defaultdict(list)

    for k,v in pointlist:

        pointdict[k].append(v)



    for key,val in pointdict.items():

        val = sum(val)

        pointdict.update({key:val})



    return pointdict

python python-3.x statistics memory-optimization

edited Dec 3 at 10:25

asked Dec 3 at 0:30

Davis

434

New contributor

edited Dec 3 at 10:25

asked Dec 3 at 0:30

Davis

434

New contributor

edited Dec 3 at 10:25

asked Dec 3 at 0:30

Davis

434

New contributor

asked Dec 3 at 0:30

Davis

434

asked Dec 3 at 0:30

Davis

434

New contributor

Davis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50

@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago

add a comment |

Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50

@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago

Have you considered using some kind of kernel density estimation? This generally gives more useful results than a histogram for sparse, only locally clustered data.
– leftaroundabout
Dec 3 at 16:50

@leftaroundabout I haven't been using this code for research for some time, but the original purpose was so that once one had some binned approximation of the catalog, one could map the weights through a function tailored to upweighting particular features in the statistics (e.g. two point correlation function), perhaps enhancing the Fisher information contained in the statistics. I am not sure whether a KDE could be used for such a purpose, but it's not impossible.
– Davis
2 days ago

add a comment |

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:

pointdict = defaultdict(list)

for k,v in pointlist:

    pointdict[k].append(v)

for key,val in pointdict.items():

    val = sum(val)

    pointdict.update({key:val})

And instead sum directly:

histogram = defaultdict(int)

for i, weight in pointlist:

    histogram[i] += weight

I also renamed the variables so it is a bit clearer what they represent.

Instead of making pointlist a list, rename it to points (or something else if you want to avoid reusing the name) and make it a generator. Currently your memory requirements are $mathcal{O}(n+m)$, where $n$ is the number of points you have (since you store them in an intermediate list) and $m$ is the number of bins (in your output dictionary). If you used a generator, this would drop to $mathcal{O}(m)$.

points = ((_bincenter(x), x[3]) for x in points)

edited Dec 3 at 13:54

answered Dec 3 at 9:41

Graipher

22.8k53384

I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
– Davis
yesterday

add a comment |

up vote
7
down vote

Some suggestions:

Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.

Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:
```
if bin_index:

    binning_function = _bincenter

else:

    …



if weights:

    …
```

Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.

Rather than using a defaultdict(list) you can just pointdict.get(key, ) if you know the dict is sparse. This is purposely not a suggestion because whether it's appropriate or not depends on the programming environment, but it is an option.

answered Dec 3 at 2:08

l0b0

4,112923

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Davis is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208899%2fsparse-3d-binning-histogram-function%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:

pointdict = defaultdict(list)

for k,v in pointlist:

    pointdict[k].append(v)

for key,val in pointdict.items():

    val = sum(val)

    pointdict.update({key:val})

And instead sum directly:

histogram = defaultdict(int)

for i, weight in pointlist:

    histogram[i] += weight

I also renamed the variables so it is a bit clearer what they represent.

points = ((_bincenter(x), x[3]) for x in points)

edited Dec 3 at 13:54

answered Dec 3 at 9:41

Graipher

22.8k53384

I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
– Davis
yesterday

add a comment |

up vote
5
down vote

accepted

Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:

pointdict = defaultdict(list)

for k,v in pointlist:

    pointdict[k].append(v)

for key,val in pointdict.items():

    val = sum(val)

    pointdict.update({key:val})

And instead sum directly:

histogram = defaultdict(int)

for i, weight in pointlist:

    histogram[i] += weight

I also renamed the variables so it is a bit clearer what they represent.

points = ((_bincenter(x), x[3]) for x in points)

edited Dec 3 at 13:54

answered Dec 3 at 9:41

Graipher

22.8k53384

I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
– Davis
yesterday

add a comment |

up vote
5
down vote

accepted

Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:

pointdict = defaultdict(list)

for k,v in pointlist:

    pointdict[k].append(v)

for key,val in pointdict.items():

    val = sum(val)

    pointdict.update({key:val})

And instead sum directly:

histogram = defaultdict(int)

for i, weight in pointlist:

    histogram[i] += weight

I also renamed the variables so it is a bit clearer what they represent.

points = ((_bincenter(x), x[3]) for x in points)

edited Dec 3 at 13:54

answered Dec 3 at 9:41

Graipher

22.8k53384

Since you don't care about the distinct values, only the sum in each bin, you can avoid one of these for loops:

pointdict = defaultdict(list)

for k,v in pointlist:

    pointdict[k].append(v)

for key,val in pointdict.items():

    val = sum(val)

    pointdict.update({key:val})

And instead sum directly:

histogram = defaultdict(int)

for i, weight in pointlist:

    histogram[i] += weight

I also renamed the variables so it is a bit clearer what they represent.

points = ((_bincenter(x), x[3]) for x in points)

edited Dec 3 at 13:54

answered Dec 3 at 9:41

Graipher

22.8k53384

edited Dec 3 at 13:54

answered Dec 3 at 9:41

Graipher

22.8k53384

answered Dec 3 at 9:41

Graipher

22.8k53384

answered Dec 3 at 9:41

Graipher

22.8k53384

I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
– Davis
yesterday

add a comment |

I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
– Davis
yesterday

I'm accepting this answer due to it not only improving my readability and structure but also my algorithm, however both answers have contributed significantly.
– Davis
yesterday

add a comment |

up vote
7
down vote

Some suggestions:

Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.

Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:
```
if bin_index:

    binning_function = _bincenter

else:

    …



if weights:

    …
```

Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.

answered Dec 3 at 2:08

l0b0

4,112923

add a comment |

up vote
7
down vote

Some suggestions:

Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.

Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:
```
if bin_index:

    binning_function = _bincenter

else:

    …



if weights:

    …
```

Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.

answered Dec 3 at 2:08

l0b0

4,112923

add a comment |

up vote
7
down vote

Some suggestions:

Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.

Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:
```
if bin_index:

    binning_function = _bincenter

else:

    …



if weights:

    …
```

Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.

answered Dec 3 at 2:08

l0b0

4,112923

Some suggestions:

Passing the code through at least one linter such as flake8 or pycodestyle will make for more idiomatic and therefore readable code.

Do you have any automated tests? If so, do they test all four combinations of weights and bin_index?

Do you really need weights and bin_index to be defaulted? I'm generally suspicious of code which defaults parameter values without a very clear reason.

Functions are first class citizens in Python, so rather than your nested if statements you can decide which method to call and what the second parameter should be with two unnested if statements:
```
if bin_index:

    binning_function = _bincenter

else:

    …



if weights:

    …
```

Naming is very important. I can't easily follow your code because things are named for what they are and not for what they contain or what they are used for. For example, pointdict tells me nothing about its contents and res could be short for “result”, “response” or something else entirely.

Reusing variables is generally discouraged - they make the information flow confusing, and they should be optimized away by the interpreter.

I don't remember the precedence rules other than for simple arithmetic by heart, so I'd enforce it using parentheses in expressions like x//res+0.5.

answered Dec 3 at 2:08

l0b0

4,112923

answered Dec 3 at 2:08

l0b0

4,112923

answered Dec 3 at 2:08

l0b0

4,112923

answered Dec 3 at 2:08

l0b0

4,112923

add a comment |

Davis is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Davis is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr