Faster implementation of LSH (AND-OR)

up vote
2
down vote

favorite

I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I have implemented Locality-sensitive Hashing(LSH) using AND-OR schema learned from chapter-3 of 'Mining of Massive Datasets' to find candidate pairs using for-loop in a for-loop but it took 30 minutes. I want to reduce this time. Is there any faster way?

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

The Minhash signature matrix of size (160000,200) is a numpy array. My idea is, If I can convert it into (160000,40) array cheaply, where each element of new array is formed from 5 elements of minhash array, then maybe I can use numpy.unique() to get unique r_signature for each column to be used as keys for dictionary of candidate ids. I am new to python as well as coding. I can't think of a way to optimize it to make it run faster.

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

Note: I have observed that the time taken for Minhash part is increasing linearly with data(no.of users in this case), whereas for LSH part it is increasing non-linearly (for the first 6.25% it took 20.15 seconds and for the last 6.25% it took 132.3 seconds). I think it's necessary to optimize this part, if possible, to scale properly with data. I believe checking whether the key is already present in the dictionary is the part of code that is responsible for this.

Update: I have solved this by avoiding checking the presence of key in a dict, though I ended up using for-loop in a for-loop twice. Now it is taking 310 seconds for 160000 candidates and the time taken is scaling linearly with data. I have updated the corresponding code in the google-colab notebook.

edited Nov 17 at 12:40

asked Nov 17 at 6:27

Ramki

113

add a comment |

up vote
2
down vote

favorite

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

edited Nov 17 at 12:40

asked Nov 17 at 6:27

Ramki

113

add a comment |

up vote
2
down vote

favorite

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

edited Nov 17 at 12:40

asked Nov 17 at 6:27

Ramki

113

Here is how I have done LSH - Minhash signature length (n) = 200,
sub-signature length (r) = 5, number of bands (b) = 40.

bucket-of-ids = 'empty list of dictionaries of 

                 length 40'

for each-user in 160000:

  for each-band in 40:

    r_signature = string(jth 5 elements)

    if r_signature in bucket-of-ids[band]:

       'add id of user to dictionary of band 

        using r_signature as key'

    else :

       'create r_signature as new key and then 

        add user id to it as list of values'

Here is the link to code as well as data :
https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD

python locality-sensitive-hash minhash

edited Nov 17 at 12:40

asked Nov 17 at 6:27

Ramki

113

edited Nov 17 at 12:40

asked Nov 17 at 6:27

Ramki

113

edited Nov 17 at 12:40

asked Nov 17 at 6:27

Ramki

113

asked Nov 17 at 6:27

Ramki

113

asked Nov 17 at 6:27

Ramki

113

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53348824%2ffaster-implementation-of-lsh-and-or%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Z7WdRj2XzCXfiIYvrBCDxC3g5e1,1,ttiv02,zGN4M7LOj WxSU6X8aNwBPRyttVOoxp6uD,y2DD

搜尋此網誌

Argthtjtr