Regular expression 'w+' is supposed to return only words in english, but it is working differently












2















s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)


I expected the above code to return only english words, but i am getting something as below.



['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']


Could someone explain how this is working?










share|improve this question




















  • 2





    Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

    – Tim Biegeleisen
    Nov 23 '18 at 6:08











  • @TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

    – David Z
    Nov 23 '18 at 6:11













  • "I expected the above code to return only english words" - why?

    – user2357112
    Nov 23 '18 at 6:14






  • 2





    re.findall(r'w+',s, re.ASCII) ?

    – itzMEonTV
    Nov 23 '18 at 6:25











  • words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

    – Cua
    Nov 23 '18 at 6:42
















2















s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)


I expected the above code to return only english words, but i am getting something as below.



['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']


Could someone explain how this is working?










share|improve this question




















  • 2





    Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

    – Tim Biegeleisen
    Nov 23 '18 at 6:08











  • @TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

    – David Z
    Nov 23 '18 at 6:11













  • "I expected the above code to return only english words" - why?

    – user2357112
    Nov 23 '18 at 6:14






  • 2





    re.findall(r'w+',s, re.ASCII) ?

    – itzMEonTV
    Nov 23 '18 at 6:25











  • words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

    – Cua
    Nov 23 '18 at 6:42














2












2








2








s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)


I expected the above code to return only english words, but i am getting something as below.



['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']


Could someone explain how this is working?










share|improve this question
















s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)


I expected the above code to return only english words, but i am getting something as below.



['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']


Could someone explain how this is working?







python regex






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 6:01









Tim Biegeleisen

234k1399157




234k1399157










asked Nov 23 '18 at 5:57









Amarnath ReddyAmarnath Reddy

472




472








  • 2





    Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

    – Tim Biegeleisen
    Nov 23 '18 at 6:08











  • @TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

    – David Z
    Nov 23 '18 at 6:11













  • "I expected the above code to return only english words" - why?

    – user2357112
    Nov 23 '18 at 6:14






  • 2





    re.findall(r'w+',s, re.ASCII) ?

    – itzMEonTV
    Nov 23 '18 at 6:25











  • words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

    – Cua
    Nov 23 '18 at 6:42














  • 2





    Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

    – Tim Biegeleisen
    Nov 23 '18 at 6:08











  • @TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

    – David Z
    Nov 23 '18 at 6:11













  • "I expected the above code to return only english words" - why?

    – user2357112
    Nov 23 '18 at 6:14






  • 2





    re.findall(r'w+',s, re.ASCII) ?

    – itzMEonTV
    Nov 23 '18 at 6:25











  • words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

    – Cua
    Nov 23 '18 at 6:42








2




2





Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08





Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08













@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11







@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11















"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14





"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14




2




2





re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25





re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25













words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42





words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42












5 Answers
5






active

oldest

votes


















2














I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:




For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.



For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.




The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.



If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.






share|improve this answer

































    1














    I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:



    words = re.findall(r'[A-Za-z0-9_]+', s)
    print(words)





    share|improve this answer































      0














      words = re.findall(r'w+',s)


      The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for




      w



      When the LOCALE and UNICODE flags are not specified, matches any
      alphanumeric character and the underscore
      ; this is equivalent to the
      set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
      whatever characters are defined as alphanumeric for the current
      locale. If UNICODE is set, this will match the characters [0-9_] plus
      whatever is classified as alphanumeric in the Unicode character
      properties database.




      That is why.






      share|improve this answer



















      • 1





        I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

        – David Z
        Nov 23 '18 at 6:13











      • Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

        – Reckless Engineer
        Nov 23 '18 at 6:19






      • 1





        Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

        – David Z
        Nov 23 '18 at 6:23



















      0














      Modify your code as given below to know why it is printing like that:



      s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 
      working on this'
      words = re.findall(r'w+',s)
      print(words)


      for letter in s:
      print(letter)


      OUTPUT



      ['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']




































      ,

      2
      4









































      I

      a
      m

      w
      o
      r
      k
      i
      n
      g

      o
      n

      t
      h
      i
      s


      Those circles are kind of spaces [ As understood by the code ]






      share|improve this answer































        0














        Look at @itzMEonTV's suggestion:



        In [46]: rex=re.compile(r'w+')                                                                                               
        In [47]: rex
        Out[47]: re.compile(r'w+', re.UNICODE)





        share|improve this answer























          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441317%2fregular-expression-w-is-supposed-to-return-only-words-in-english-but-it-is%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          5 Answers
          5






          active

          oldest

          votes








          5 Answers
          5






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:




          For Unicode (str) patterns:

          Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.



          For 8-bit (bytes) patterns:

          Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.




          The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.



          If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.






          share|improve this answer






























            2














            I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:




            For Unicode (str) patterns:

            Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.



            For 8-bit (bytes) patterns:

            Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.




            The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.



            If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.






            share|improve this answer




























              2












              2








              2







              I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:




              For Unicode (str) patterns:

              Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.



              For 8-bit (bytes) patterns:

              Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.




              The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.



              If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.






              share|improve this answer















              I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:




              For Unicode (str) patterns:

              Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.



              For 8-bit (bytes) patterns:

              Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.




              The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.



              If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 23 '18 at 6:27

























              answered Nov 23 '18 at 6:18









              user2357112user2357112

              157k12172266




              157k12172266

























                  1














                  I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:



                  words = re.findall(r'[A-Za-z0-9_]+', s)
                  print(words)





                  share|improve this answer




























                    1














                    I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:



                    words = re.findall(r'[A-Za-z0-9_]+', s)
                    print(words)





                    share|improve this answer


























                      1












                      1








                      1







                      I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:



                      words = re.findall(r'[A-Za-z0-9_]+', s)
                      print(words)





                      share|improve this answer













                      I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:



                      words = re.findall(r'[A-Za-z0-9_]+', s)
                      print(words)






                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Nov 23 '18 at 6:04









                      Tim BiegeleisenTim Biegeleisen

                      234k1399157




                      234k1399157























                          0














                          words = re.findall(r'w+',s)


                          The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for




                          w



                          When the LOCALE and UNICODE flags are not specified, matches any
                          alphanumeric character and the underscore
                          ; this is equivalent to the
                          set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
                          whatever characters are defined as alphanumeric for the current
                          locale. If UNICODE is set, this will match the characters [0-9_] plus
                          whatever is classified as alphanumeric in the Unicode character
                          properties database.




                          That is why.






                          share|improve this answer



















                          • 1





                            I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

                            – David Z
                            Nov 23 '18 at 6:13











                          • Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

                            – Reckless Engineer
                            Nov 23 '18 at 6:19






                          • 1





                            Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

                            – David Z
                            Nov 23 '18 at 6:23
















                          0














                          words = re.findall(r'w+',s)


                          The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for




                          w



                          When the LOCALE and UNICODE flags are not specified, matches any
                          alphanumeric character and the underscore
                          ; this is equivalent to the
                          set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
                          whatever characters are defined as alphanumeric for the current
                          locale. If UNICODE is set, this will match the characters [0-9_] plus
                          whatever is classified as alphanumeric in the Unicode character
                          properties database.




                          That is why.






                          share|improve this answer



















                          • 1





                            I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

                            – David Z
                            Nov 23 '18 at 6:13











                          • Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

                            – Reckless Engineer
                            Nov 23 '18 at 6:19






                          • 1





                            Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

                            – David Z
                            Nov 23 '18 at 6:23














                          0












                          0








                          0







                          words = re.findall(r'w+',s)


                          The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for




                          w



                          When the LOCALE and UNICODE flags are not specified, matches any
                          alphanumeric character and the underscore
                          ; this is equivalent to the
                          set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
                          whatever characters are defined as alphanumeric for the current
                          locale. If UNICODE is set, this will match the characters [0-9_] plus
                          whatever is classified as alphanumeric in the Unicode character
                          properties database.




                          That is why.






                          share|improve this answer













                          words = re.findall(r'w+',s)


                          The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for




                          w



                          When the LOCALE and UNICODE flags are not specified, matches any
                          alphanumeric character and the underscore
                          ; this is equivalent to the
                          set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
                          whatever characters are defined as alphanumeric for the current
                          locale. If UNICODE is set, this will match the characters [0-9_] plus
                          whatever is classified as alphanumeric in the Unicode character
                          properties database.




                          That is why.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 23 '18 at 6:09









                          Reckless EngineerReckless Engineer

                          768




                          768








                          • 1





                            I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

                            – David Z
                            Nov 23 '18 at 6:13











                          • Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

                            – Reckless Engineer
                            Nov 23 '18 at 6:19






                          • 1





                            Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

                            – David Z
                            Nov 23 '18 at 6:23














                          • 1





                            I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

                            – David Z
                            Nov 23 '18 at 6:13











                          • Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

                            – Reckless Engineer
                            Nov 23 '18 at 6:19






                          • 1





                            Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

                            – David Z
                            Nov 23 '18 at 6:23








                          1




                          1





                          I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

                          – David Z
                          Nov 23 '18 at 6:13





                          I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

                          – David Z
                          Nov 23 '18 at 6:13













                          Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

                          – Reckless Engineer
                          Nov 23 '18 at 6:19





                          Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

                          – Reckless Engineer
                          Nov 23 '18 at 6:19




                          1




                          1





                          Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

                          – David Z
                          Nov 23 '18 at 6:23





                          Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

                          – David Z
                          Nov 23 '18 at 6:23











                          0














                          Modify your code as given below to know why it is printing like that:



                          s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 
                          working on this'
                          words = re.findall(r'w+',s)
                          print(words)


                          for letter in s:
                          print(letter)


                          OUTPUT



                          ['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']




































                          ,

                          2
                          4









































                          I

                          a
                          m

                          w
                          o
                          r
                          k
                          i
                          n
                          g

                          o
                          n

                          t
                          h
                          i
                          s


                          Those circles are kind of spaces [ As understood by the code ]






                          share|improve this answer




























                            0














                            Modify your code as given below to know why it is printing like that:



                            s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 
                            working on this'
                            words = re.findall(r'w+',s)
                            print(words)


                            for letter in s:
                            print(letter)


                            OUTPUT



                            ['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']




































                            ,

                            2
                            4









































                            I

                            a
                            m

                            w
                            o
                            r
                            k
                            i
                            n
                            g

                            o
                            n

                            t
                            h
                            i
                            s


                            Those circles are kind of spaces [ As understood by the code ]






                            share|improve this answer


























                              0












                              0








                              0







                              Modify your code as given below to know why it is printing like that:



                              s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 
                              working on this'
                              words = re.findall(r'w+',s)
                              print(words)


                              for letter in s:
                              print(letter)


                              OUTPUT



                              ['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']




































                              ,

                              2
                              4









































                              I

                              a
                              m

                              w
                              o
                              r
                              k
                              i
                              n
                              g

                              o
                              n

                              t
                              h
                              i
                              s


                              Those circles are kind of spaces [ As understood by the code ]






                              share|improve this answer













                              Modify your code as given below to know why it is printing like that:



                              s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 
                              working on this'
                              words = re.findall(r'w+',s)
                              print(words)


                              for letter in s:
                              print(letter)


                              OUTPUT



                              ['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']




































                              ,

                              2
                              4









































                              I

                              a
                              m

                              w
                              o
                              r
                              k
                              i
                              n
                              g

                              o
                              n

                              t
                              h
                              i
                              s


                              Those circles are kind of spaces [ As understood by the code ]







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Nov 23 '18 at 6:32









                              MPJMPJ

                              301116




                              301116























                                  0














                                  Look at @itzMEonTV's suggestion:



                                  In [46]: rex=re.compile(r'w+')                                                                                               
                                  In [47]: rex
                                  Out[47]: re.compile(r'w+', re.UNICODE)





                                  share|improve this answer




























                                    0














                                    Look at @itzMEonTV's suggestion:



                                    In [46]: rex=re.compile(r'w+')                                                                                               
                                    In [47]: rex
                                    Out[47]: re.compile(r'w+', re.UNICODE)





                                    share|improve this answer


























                                      0












                                      0








                                      0







                                      Look at @itzMEonTV's suggestion:



                                      In [46]: rex=re.compile(r'w+')                                                                                               
                                      In [47]: rex
                                      Out[47]: re.compile(r'w+', re.UNICODE)





                                      share|improve this answer













                                      Look at @itzMEonTV's suggestion:



                                      In [46]: rex=re.compile(r'w+')                                                                                               
                                      In [47]: rex
                                      Out[47]: re.compile(r'w+', re.UNICODE)






                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Nov 23 '18 at 6:56









                                      kantalkantal

                                      642128




                                      642128






























                                          draft saved

                                          draft discarded




















































                                          Thanks for contributing an answer to Stack Overflow!


                                          • Please be sure to answer the question. Provide details and share your research!

                                          But avoid



                                          • Asking for help, clarification, or responding to other answers.

                                          • Making statements based on opinion; back them up with references or personal experience.


                                          To learn more, see our tips on writing great answers.




                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441317%2fregular-expression-w-is-supposed-to-return-only-words-in-english-but-it-is%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest















                                          Required, but never shown





















































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown

































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown







                                          Popular posts from this blog

                                          Paul Cézanne

                                          UIScrollView CustomStickyHeader Resize height generates problems when scroll is too fast

                                          Angular material date-picker (MatDatepicker) auto completes the date on focus out