Python: How to get the similar-sounding words together












17















I am trying to get all the similar sounding words from a list.



I tried to get them using cosine similarity but that does not fulfil my purpose.



from sklearn.metrics.pairwise import cosine_similarity
dataList = ['two','fourth','forth','dessert','to','desert']
cosine_similarity(dataList)


I know this is not the right approach, I cannot seem to get a result like:



result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


where they mean that the words which sound similar










share|improve this question





























    17















    I am trying to get all the similar sounding words from a list.



    I tried to get them using cosine similarity but that does not fulfil my purpose.



    from sklearn.metrics.pairwise import cosine_similarity
    dataList = ['two','fourth','forth','dessert','to','desert']
    cosine_similarity(dataList)


    I know this is not the right approach, I cannot seem to get a result like:



    result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


    where they mean that the words which sound similar










    share|improve this question



























      17












      17








      17


      4






      I am trying to get all the similar sounding words from a list.



      I tried to get them using cosine similarity but that does not fulfil my purpose.



      from sklearn.metrics.pairwise import cosine_similarity
      dataList = ['two','fourth','forth','dessert','to','desert']
      cosine_similarity(dataList)


      I know this is not the right approach, I cannot seem to get a result like:



      result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


      where they mean that the words which sound similar










      share|improve this question
















      I am trying to get all the similar sounding words from a list.



      I tried to get them using cosine similarity but that does not fulfil my purpose.



      from sklearn.metrics.pairwise import cosine_similarity
      dataList = ['two','fourth','forth','dessert','to','desert']
      cosine_similarity(dataList)


      I know this is not the right approach, I cannot seem to get a result like:



      result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


      where they mean that the words which sound similar







      python python-3.x list






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 25 at 11:19









      DirtyBit

      11.9k21842




      11.9k21842










      asked Mar 25 at 5:31









      Marc StochMarc Stoch

      884




      884
























          1 Answer
          1






          active

          oldest

          votes


















          29














          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer


























          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            Mar 25 at 9:16













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            Mar 25 at 9:47











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            Mar 25 at 9:49






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            Mar 25 at 11:14






          • 1





            Oooooo nice answer sir ;) +1

            – Matt B.
            Mar 26 at 13:32












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55331723%2fpython-how-to-get-the-similar-sounding-words-together%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          29














          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer


























          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            Mar 25 at 9:16













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            Mar 25 at 9:47











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            Mar 25 at 9:49






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            Mar 25 at 11:14






          • 1





            Oooooo nice answer sir ;) +1

            – Matt B.
            Mar 26 at 13:32
















          29














          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer


























          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            Mar 25 at 9:16













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            Mar 25 at 9:47











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            Mar 25 at 9:49






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            Mar 25 at 11:14






          • 1





            Oooooo nice answer sir ;) +1

            – Matt B.
            Mar 26 at 13:32














          29












          29








          29







          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer















          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 26 at 6:07

























          answered Mar 25 at 5:34









          DirtyBitDirtyBit

          11.9k21842




          11.9k21842













          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            Mar 25 at 9:16













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            Mar 25 at 9:47











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            Mar 25 at 9:49






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            Mar 25 at 11:14






          • 1





            Oooooo nice answer sir ;) +1

            – Matt B.
            Mar 26 at 13:32



















          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            Mar 25 at 9:16













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            Mar 25 at 9:47











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            Mar 25 at 9:49






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            Mar 25 at 11:14






          • 1





            Oooooo nice answer sir ;) +1

            – Matt B.
            Mar 26 at 13:32

















          @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

          – DirtyBit
          Mar 25 at 9:16







          @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

          – DirtyBit
          Mar 25 at 9:16















          He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

          – user2397282
          Mar 25 at 9:47





          He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

          – user2397282
          Mar 25 at 9:47













          @user2397282 Crap, I over-looked it. Thank you. edited! :)

          – DirtyBit
          Mar 25 at 9:49





          @user2397282 Crap, I over-looked it. Thank you. edited! :)

          – DirtyBit
          Mar 25 at 9:49




          1




          1





          @EricDuminil Done! :)

          – DirtyBit
          Mar 25 at 11:14





          @EricDuminil Done! :)

          – DirtyBit
          Mar 25 at 11:14




          1




          1





          Oooooo nice answer sir ;) +1

          – Matt B.
          Mar 26 at 13:32





          Oooooo nice answer sir ;) +1

          – Matt B.
          Mar 26 at 13:32




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55331723%2fpython-how-to-get-the-similar-sounding-words-together%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

          Alcedinidae

          RAC Tourist Trophy