Regular expression 'w+' is supposed to return only words in english, but it is working differently

s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'

words = re.findall(r'w+',s)

print(words)

I expected the above code to return only english words, but i am getting something as below.

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',

    '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',

    'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

Could someone explain how this is working?

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

2

Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08

@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11

"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14

2

re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25

words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42

add a comment |

s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'

words = re.findall(r'w+',s)

print(words)

I expected the above code to return only english words, but i am getting something as below.

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',

    '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',

    'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

Could someone explain how this is working?

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

2

Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08

@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11

"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14

2

re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25

words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42

add a comment |

s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'

words = re.findall(r'w+',s)

print(words)

I expected the above code to return only english words, but i am getting something as below.

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',

    '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',

    'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

Could someone explain how this is working?

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'

words = re.findall(r'w+',s)

print(words)

I expected the above code to return only english words, but i am getting something as below.

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',

    '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',

    'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

Could someone explain how this is working?

python regex

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

edited Nov 23 '18 at 6:01

Tim Biegeleisen

234k1399157

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

asked Nov 23 '18 at 5:57

Amarnath Reddy

472

2

Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08

@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11

"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14

2

re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25

words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42

add a comment |

2

Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08

@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11

"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14

2

re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25

words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42

Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.

– Tim Biegeleisen
Nov 23 '18 at 6:08

@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?

– David Z
Nov 23 '18 at 6:11

"I expected the above code to return only english words" - why?

– user2357112
Nov 23 '18 at 6:14

re.findall(r'w+',s, re.ASCII) ?

– itzMEonTV
Nov 23 '18 at 6:25

words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why

– Cua
Nov 23 '18 at 6:42

add a comment |

5 Answers
5

active

oldest

votes

I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.

If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.

edited Nov 23 '18 at 6:27

answered Nov 23 '18 at 6:18

user2357112

157k12172266

add a comment |

I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:

words = re.findall(r'[A-Za-z0-9_]+', s)

print(words)

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

add a comment |

words = re.findall(r'w+',s)

The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for

w

When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

That is why.

answered Nov 23 '18 at 6:09

Reckless Engineer

768

1

I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

– David Z
Nov 23 '18 at 6:13

Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

– Reckless Engineer
Nov 23 '18 at 6:19

1

Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

– David Z
Nov 23 '18 at 6:23

add a comment |

Modify your code as given below to know why it is printing like that:

s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 

working on this'

words = re.findall(r'w+',s)

print(words)





for letter in s:

    print(letter)

OUTPUT

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

ಆ

ತ

ಂ 

ಕ

ವ

ಾ  

ದ



ಗ

ಳ



ಗ





ವ



ಶ

ೇ

ಷ



ರ

ಕ



ಷ

ಣ





ನ

ೀ

ಡ

ು

ತ



ತ

ದ



,



2

4



ಕ



ಕ

ೂ



ಹ



ಚ



ಚ

ು



ಹ



ಂ

ದ

ೂ



ಕ

ಾ

ರ



ಯ

ಕ

ರ



ತ

ರ



ಹ

ತ



ಯ



ಯ

ಾ

ದ

ರ

ೂ



I



a

m



w

o

r

k

i

n

g



o

n



t

h

i

s

Those circles are kind of spaces [ As understood by the code ]

answered Nov 23 '18 at 6:32

MPJ

301116

add a comment |

Look at @itzMEonTV's suggestion:

In [46]: rex=re.compile(r'w+')                                                                                               

In [47]: rex                                                                                                                  

Out[47]: re.compile(r'w+', re.UNICODE)

answered Nov 23 '18 at 6:56

kantal

642128

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441317%2fregular-expression-w-is-supposed-to-return-only-words-in-english-but-it-is%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.

If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.

edited Nov 23 '18 at 6:27

answered Nov 23 '18 at 6:18

user2357112

157k12172266

add a comment |

I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.

If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.

edited Nov 23 '18 at 6:27

answered Nov 23 '18 at 6:18

user2357112

157k12172266

add a comment |

I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.

If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.

edited Nov 23 '18 at 6:27

answered Nov 23 '18 at 6:18

user2357112

157k12172266

I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.

If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.

edited Nov 23 '18 at 6:27

answered Nov 23 '18 at 6:18

user2357112

157k12172266

edited Nov 23 '18 at 6:27

answered Nov 23 '18 at 6:18

user2357112

157k12172266

answered Nov 23 '18 at 6:18

user2357112

157k12172266

answered Nov 23 '18 at 6:18

user2357112

157k12172266

add a comment |

words = re.findall(r'[A-Za-z0-9_]+', s)

print(words)

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

add a comment |

words = re.findall(r'[A-Za-z0-9_]+', s)

print(words)

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

add a comment |

words = re.findall(r'[A-Za-z0-9_]+', s)

print(words)

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

words = re.findall(r'[A-Za-z0-9_]+', s)

print(words)

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

answered Nov 23 '18 at 6:04

Tim Biegeleisen

234k1399157

add a comment |

words = re.findall(r'w+',s)

The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for

w

When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

That is why.

answered Nov 23 '18 at 6:09

Reckless Engineer

768

1

I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

– David Z
Nov 23 '18 at 6:13

Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

– Reckless Engineer
Nov 23 '18 at 6:19

1

Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

– David Z
Nov 23 '18 at 6:23

add a comment |

words = re.findall(r'w+',s)

The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for

w

When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

That is why.

answered Nov 23 '18 at 6:09

Reckless Engineer

768

1

I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

– David Z
Nov 23 '18 at 6:13

Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

– Reckless Engineer
Nov 23 '18 at 6:19

1

Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

– David Z
Nov 23 '18 at 6:23

add a comment |

words = re.findall(r'w+',s)

The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for

w

When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

That is why.

answered Nov 23 '18 at 6:09

Reckless Engineer

768

words = re.findall(r'w+',s)

The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for

w

When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

That is why.

answered Nov 23 '18 at 6:09

Reckless Engineer

768

answered Nov 23 '18 at 6:09

Reckless Engineer

768

answered Nov 23 '18 at 6:09

Reckless Engineer

768

answered Nov 23 '18 at 6:09

Reckless Engineer

768

1

I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

– David Z
Nov 23 '18 at 6:13

Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

– Reckless Engineer
Nov 23 '18 at 6:19

1

Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

– David Z
Nov 23 '18 at 6:23

add a comment |

1

I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

– David Z
Nov 23 '18 at 6:13

Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

– Reckless Engineer
Nov 23 '18 at 6:19

1

Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

– David Z
Nov 23 '18 at 6:23

I think you've got the right idea but a little backwards; actually, the problem appears to be that w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.

– David Z
Nov 23 '18 at 6:13

Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.

– Reckless Engineer
Nov 23 '18 at 6:19

Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.

– David Z
Nov 23 '18 at 6:23

add a comment |

Modify your code as given below to know why it is printing like that:

s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 

working on this'

words = re.findall(r'w+',s)

print(words)





for letter in s:

    print(letter)

OUTPUT

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

ಆ

ತ

ಂ 

ಕ

ವ

ಾ  

ದ



ಗ

ಳ



ಗ





ವ



ಶ

ೇ

ಷ



ರ

ಕ



ಷ

ಣ





ನ

ೀ

ಡ

ು

ತ



ತ

ದ



,



2

4



ಕ



ಕ

ೂ



ಹ



ಚ



ಚ

ು



ಹ



ಂ

ದ

ೂ



ಕ

ಾ

ರ



ಯ

ಕ

ರ



ತ

ರ



ಹ

ತ



ಯ



ಯ

ಾ

ದ

ರ

ೂ



I



a

m



w

o

r

k

i

n

g



o

n



t

h

i

s

Those circles are kind of spaces [ As understood by the code ]

answered Nov 23 '18 at 6:32

MPJ

301116

add a comment |

Modify your code as given below to know why it is printing like that:

s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 

working on this'

words = re.findall(r'w+',s)

print(words)





for letter in s:

    print(letter)

OUTPUT

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

ಆ

ತ

ಂ 

ಕ

ವ

ಾ  

ದ



ಗ

ಳ



ಗ





ವ



ಶ

ೇ

ಷ



ರ

ಕ



ಷ

ಣ





ನ

ೀ

ಡ

ು

ತ



ತ

ದ



,



2

4



ಕ



ಕ

ೂ



ಹ



ಚ



ಚ

ು



ಹ



ಂ

ದ

ೂ



ಕ

ಾ

ರ



ಯ

ಕ

ರ



ತ

ರ



ಹ

ತ



ಯ



ಯ

ಾ

ದ

ರ

ೂ



I



a

m



w

o

r

k

i

n

g



o

n



t

h

i

s

Those circles are kind of spaces [ As understood by the code ]

answered Nov 23 '18 at 6:32

MPJ

301116

add a comment |

Modify your code as given below to know why it is printing like that:

s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 

working on this'

words = re.findall(r'w+',s)

print(words)





for letter in s:

    print(letter)

OUTPUT

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

ಆ

ತ

ಂ 

ಕ

ವ

ಾ  

ದ



ಗ

ಳ



ಗ





ವ



ಶ

ೇ

ಷ



ರ

ಕ



ಷ

ಣ





ನ

ೀ

ಡ

ು

ತ



ತ

ದ



,



2

4



ಕ



ಕ

ೂ



ಹ



ಚ



ಚ

ು



ಹ



ಂ

ದ

ೂ



ಕ

ಾ

ರ



ಯ

ಕ

ರ



ತ

ರ



ಹ

ತ



ಯ



ಯ

ಾ

ದ

ರ

ೂ



I



a

m



w

o

r

k

i

n

g



o

n



t

h

i

s

Those circles are kind of spaces [ As understood by the code ]

answered Nov 23 '18 at 6:32

MPJ

301116

Modify your code as given below to know why it is printing like that:

s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am 

working on this'

words = re.findall(r'w+',s)

print(words)





for letter in s:

    print(letter)

OUTPUT

['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']

ಆ

ತ

ಂ 

ಕ

ವ

ಾ  

ದ



ಗ

ಳ



ಗ





ವ



ಶ

ೇ

ಷ



ರ

ಕ



ಷ

ಣ





ನ

ೀ

ಡ

ು

ತ



ತ

ದ



,



2

4



ಕ



ಕ

ೂ



ಹ



ಚ



ಚ

ು



ಹ



ಂ

ದ

ೂ



ಕ

ಾ

ರ



ಯ

ಕ

ರ



ತ

ರ



ಹ

ತ



ಯ



ಯ

ಾ

ದ

ರ

ೂ



I



a

m



w

o

r

k

i

n

g



o

n



t

h

i

s

Those circles are kind of spaces [ As understood by the code ]

answered Nov 23 '18 at 6:32

MPJ

301116

answered Nov 23 '18 at 6:32

MPJ

301116

answered Nov 23 '18 at 6:32

MPJ

301116

answered Nov 23 '18 at 6:32

MPJ

301116

add a comment |

Look at @itzMEonTV's suggestion:

In [46]: rex=re.compile(r'w+')                                                                                               

In [47]: rex                                                                                                                  

Out[47]: re.compile(r'w+', re.UNICODE)

answered Nov 23 '18 at 6:56

kantal

642128

add a comment |

Look at @itzMEonTV's suggestion:

In [46]: rex=re.compile(r'w+')                                                                                               

In [47]: rex                                                                                                                  

Out[47]: re.compile(r'w+', re.UNICODE)

answered Nov 23 '18 at 6:56

kantal

642128

add a comment |

Look at @itzMEonTV's suggestion:

In [46]: rex=re.compile(r'w+')                                                                                               

In [47]: rex                                                                                                                  

Out[47]: re.compile(r'w+', re.UNICODE)

answered Nov 23 '18 at 6:56

kantal

642128

Look at @itzMEonTV's suggestion:

In [46]: rex=re.compile(r'w+')                                                                                               

In [47]: rex                                                                                                                  

Out[47]: re.compile(r'w+', re.UNICODE)

answered Nov 23 '18 at 6:56

kantal

642128

answered Nov 23 '18 at 6:56

kantal

642128

answered Nov 23 '18 at 6:56

kantal

642128

answered Nov 23 '18 at 6:56

kantal

642128

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr