Regular expression 'w+' is supposed to return only words in english, but it is working differently
s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)
I expected the above code to return only english words, but i am getting something as below.
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
Could someone explain how this is working?
python regex
add a comment |
s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)
I expected the above code to return only english words, but i am getting something as below.
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
Could someone explain how this is working?
python regex
2
Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.
– Tim Biegeleisen
Nov 23 '18 at 6:08
@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?
– David Z
Nov 23 '18 at 6:11
"I expected the above code to return only english words" - why?
– user2357112
Nov 23 '18 at 6:14
2
re.findall(r'w+',s, re.ASCII)?
– itzMEonTV
Nov 23 '18 at 6:25
words = re.sub(r'[^a-zA-Z ]','',s)sub works fine but its different in findall not sure why
– Cua
Nov 23 '18 at 6:42
add a comment |
s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)
I expected the above code to return only english words, but i am getting something as below.
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
Could someone explain how this is working?
python regex
s = 'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am working on this'
words = re.findall(r'w+',s)
print(words)
I expected the above code to return only english words, but i am getting something as below.
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ',
'24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ',
'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
Could someone explain how this is working?
python regex
python regex
edited Nov 23 '18 at 6:01
Tim Biegeleisen
234k1399157
234k1399157
asked Nov 23 '18 at 5:57
Amarnath ReddyAmarnath Reddy
472
472
2
Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.
– Tim Biegeleisen
Nov 23 '18 at 6:08
@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?
– David Z
Nov 23 '18 at 6:11
"I expected the above code to return only english words" - why?
– user2357112
Nov 23 '18 at 6:14
2
re.findall(r'w+',s, re.ASCII)?
– itzMEonTV
Nov 23 '18 at 6:25
words = re.sub(r'[^a-zA-Z ]','',s)sub works fine but its different in findall not sure why
– Cua
Nov 23 '18 at 6:42
add a comment |
2
Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.
– Tim Biegeleisen
Nov 23 '18 at 6:08
@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?
– David Z
Nov 23 '18 at 6:11
"I expected the above code to return only english words" - why?
– user2357112
Nov 23 '18 at 6:14
2
re.findall(r'w+',s, re.ASCII)?
– itzMEonTV
Nov 23 '18 at 6:25
words = re.sub(r'[^a-zA-Z ]','',s)sub works fine but its different in findall not sure why
– Cua
Nov 23 '18 at 6:42
2
2
Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.
– Tim Biegeleisen
Nov 23 '18 at 6:08
Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.
– Tim Biegeleisen
Nov 23 '18 at 6:08
@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?
– David Z
Nov 23 '18 at 6:11
@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?
– David Z
Nov 23 '18 at 6:11
"I expected the above code to return only english words" - why?
– user2357112
Nov 23 '18 at 6:14
"I expected the above code to return only english words" - why?
– user2357112
Nov 23 '18 at 6:14
2
2
re.findall(r'w+',s, re.ASCII) ?– itzMEonTV
Nov 23 '18 at 6:25
re.findall(r'w+',s, re.ASCII) ?– itzMEonTV
Nov 23 '18 at 6:25
words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why– Cua
Nov 23 '18 at 6:42
words = re.sub(r'[^a-zA-Z ]','',s) sub works fine but its different in findall not sure why– Cua
Nov 23 '18 at 6:42
add a comment |
5 Answers
5
active
oldest
votes
I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.
If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.
add a comment |
I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:
words = re.findall(r'[A-Za-z0-9_]+', s)
print(words)
add a comment |
words = re.findall(r'w+',s)
The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for
w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
That is why.
1
I think you've got the right idea but a little backwards; actually, the problem appears to be thatwin the OP's code sample is picking up more than just[a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.
– David Z
Nov 23 '18 at 6:13
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
1
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
add a comment |
Modify your code as given below to know why it is printing like that:
s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am
working on this'
words = re.findall(r'w+',s)
print(words)
for letter in s:
print(letter)
OUTPUT
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
ಆ
ತ
ಂ
ಕ
ವ
ಾ
ದ
ಗ
ಳ
ಗ
ವ
ಶ
ೇ
ಷ
ರ
ಕ
ಷ
ಣ
ನ
ೀ
ಡ
ು
ತ
ತ
ದ
,
2
4
ಕ
ಕ
ೂ
ಹ
ಚ
ಚ
ು
ಹ
ಂ
ದ
ೂ
ಕ
ಾ
ರ
ಯ
ಕ
ರ
ತ
ರ
ಹ
ತ
ಯ
ಯ
ಾ
ದ
ರ
ೂ
I
a
m
w
o
r
k
i
n
g
o
n
t
h
i
s
Those circles are kind of spaces [ As understood by the code ]
add a comment |
Look at @itzMEonTV's suggestion:
In [46]: rex=re.compile(r'w+')
In [47]: rex
Out[47]: re.compile(r'w+', re.UNICODE)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441317%2fregular-expression-w-is-supposed-to-return-only-words-in-english-but-it-is%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.
If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.
add a comment |
I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.
If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.
add a comment |
I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.
If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.
I don't know why you expected w+ to only match English words. It doesn't even do that in ASCII mode. It matches any sequence of w characters, and the docs describe the actual behavior of w:
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
The docs unfortunately don't get any more specific than that, but w definitely isn't restricted to English.
If you wanted [a-zA-Z0-9_], you can write out your intended character class explicitly, or you can use the re.ASCII flag. If you wanted [a-zA-Z], write that out explicitly.
edited Nov 23 '18 at 6:27
answered Nov 23 '18 at 6:18
user2357112user2357112
157k12172266
157k12172266
add a comment |
add a comment |
I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:
words = re.findall(r'[A-Za-z0-9_]+', s)
print(words)
add a comment |
I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:
words = re.findall(r'[A-Za-z0-9_]+', s)
print(words)
add a comment |
I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:
words = re.findall(r'[A-Za-z0-9_]+', s)
print(words)
I cannot reproduce your observations, see the demo. Perhaps there is some encoding issue on your end, which is why w is picking up on Tamil characters. But, one workaround you could use here would be to just explicitly spell out of what the character class w actually consists:
words = re.findall(r'[A-Za-z0-9_]+', s)
print(words)
answered Nov 23 '18 at 6:04
Tim BiegeleisenTim Biegeleisen
234k1399157
234k1399157
add a comment |
add a comment |
words = re.findall(r'w+',s)
The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for
w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
That is why.
1
I think you've got the right idea but a little backwards; actually, the problem appears to be thatwin the OP's code sample is picking up more than just[a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.
– David Z
Nov 23 '18 at 6:13
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
1
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
add a comment |
words = re.findall(r'w+',s)
The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for
w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
That is why.
1
I think you've got the right idea but a little backwards; actually, the problem appears to be thatwin the OP's code sample is picking up more than just[a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.
– David Z
Nov 23 '18 at 6:13
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
1
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
add a comment |
words = re.findall(r'w+',s)
The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for
w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
That is why.
words = re.findall(r'w+',s)
The reason w+ doesn't pick up what you want is that it's missing the Unicode flag. The other answers here ignore encoding by simply saying which specific letters they are looking for
w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
That is why.
answered Nov 23 '18 at 6:09
Reckless EngineerReckless Engineer
768
768
1
I think you've got the right idea but a little backwards; actually, the problem appears to be thatwin the OP's code sample is picking up more than just[a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.
– David Z
Nov 23 '18 at 6:13
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
1
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
add a comment |
1
I think you've got the right idea but a little backwards; actually, the problem appears to be thatwin the OP's code sample is picking up more than just[a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.
– David Z
Nov 23 '18 at 6:13
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
1
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
1
1
I think you've got the right idea but a little backwards; actually, the problem appears to be that
w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.– David Z
Nov 23 '18 at 6:13
I think you've got the right idea but a little backwards; actually, the problem appears to be that
w in the OP's code sample is picking up more than just [a-zA-Z0-9_]. I believe the difference is that what you're saying here applies to Python 2, but I suspect the code in the question is meant to run with Python 3.– David Z
Nov 23 '18 at 6:13
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
Guess you're right.. though it could just be a localization issue, as stated above.. May require more thorough investigating.
– Reckless Engineer
Nov 23 '18 at 6:19
1
1
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
Well, I can't rule it out, but I'm considerably more confident that it's a 2 vs 3 issue given that running the OP's code sample in Python 3 reproduces their issue but running it in Python 2 does not, and that your quote, which comes from the Python 2 documentation, describes the opposite behavior of what the OP is seeing, i.e. precisely what would happen if you run their code sample under Python 2.
– David Z
Nov 23 '18 at 6:23
add a comment |
Modify your code as given below to know why it is printing like that:
s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am
working on this'
words = re.findall(r'w+',s)
print(words)
for letter in s:
print(letter)
OUTPUT
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
ಆ
ತ
ಂ
ಕ
ವ
ಾ
ದ
ಗ
ಳ
ಗ
ವ
ಶ
ೇ
ಷ
ರ
ಕ
ಷ
ಣ
ನ
ೀ
ಡ
ು
ತ
ತ
ದ
,
2
4
ಕ
ಕ
ೂ
ಹ
ಚ
ಚ
ು
ಹ
ಂ
ದ
ೂ
ಕ
ಾ
ರ
ಯ
ಕ
ರ
ತ
ರ
ಹ
ತ
ಯ
ಯ
ಾ
ದ
ರ
ೂ
I
a
m
w
o
r
k
i
n
g
o
n
t
h
i
s
Those circles are kind of spaces [ As understood by the code ]
add a comment |
Modify your code as given below to know why it is printing like that:
s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am
working on this'
words = re.findall(r'w+',s)
print(words)
for letter in s:
print(letter)
OUTPUT
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
ಆ
ತ
ಂ
ಕ
ವ
ಾ
ದ
ಗ
ಳ
ಗ
ವ
ಶ
ೇ
ಷ
ರ
ಕ
ಷ
ಣ
ನ
ೀ
ಡ
ು
ತ
ತ
ದ
,
2
4
ಕ
ಕ
ೂ
ಹ
ಚ
ಚ
ು
ಹ
ಂ
ದ
ೂ
ಕ
ಾ
ರ
ಯ
ಕ
ರ
ತ
ರ
ಹ
ತ
ಯ
ಯ
ಾ
ದ
ರ
ೂ
I
a
m
w
o
r
k
i
n
g
o
n
t
h
i
s
Those circles are kind of spaces [ As understood by the code ]
add a comment |
Modify your code as given below to know why it is printing like that:
s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am
working on this'
words = re.findall(r'w+',s)
print(words)
for letter in s:
print(letter)
OUTPUT
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
ಆ
ತ
ಂ
ಕ
ವ
ಾ
ದ
ಗ
ಳ
ಗ
ವ
ಶ
ೇ
ಷ
ರ
ಕ
ಷ
ಣ
ನ
ೀ
ಡ
ು
ತ
ತ
ದ
,
2
4
ಕ
ಕ
ೂ
ಹ
ಚ
ಚ
ು
ಹ
ಂ
ದ
ೂ
ಕ
ಾ
ರ
ಯ
ಕ
ರ
ತ
ರ
ಹ
ತ
ಯ
ಯ
ಾ
ದ
ರ
ೂ
I
a
m
w
o
r
k
i
n
g
o
n
t
h
i
s
Those circles are kind of spaces [ As understood by the code ]
Modify your code as given below to know why it is printing like that:
s = u'ಆತಂಕವಾದಿಗಳಿಗೆ ವಿಶೇಷ ರಕ್ಷಣೆ ನೀಡುತ್ತದೆ, 24 ಕ್ಕೂ ಹೆಚ್ಚು ಹಿಂದೂ ಕಾರ್ಯಕರ್ತರ ಹತ್ಯೆಯಾದರೂ I am
working on this'
words = re.findall(r'w+',s)
print(words)
for letter in s:
print(letter)
OUTPUT
['ಆತ', 'ಕವ', 'ದ', 'ಗಳ', 'ಗ', 'ವ', 'ಶ', 'ಷ', 'ರಕ', 'ಷಣ', 'ನ', 'ಡ', 'ತ', 'ತದ', '24', 'ಕ', 'ಕ', 'ಹ', 'ಚ', 'ಚ', 'ಹ', 'ದ', 'ಕ', 'ರ', 'ಯಕರ', 'ತರ', 'ಹತ', 'ಯ', 'ಯ', 'ದರ', 'I', 'am', 'working', 'on', 'this']
ಆ
ತ
ಂ
ಕ
ವ
ಾ
ದ
ಗ
ಳ
ಗ
ವ
ಶ
ೇ
ಷ
ರ
ಕ
ಷ
ಣ
ನ
ೀ
ಡ
ು
ತ
ತ
ದ
,
2
4
ಕ
ಕ
ೂ
ಹ
ಚ
ಚ
ು
ಹ
ಂ
ದ
ೂ
ಕ
ಾ
ರ
ಯ
ಕ
ರ
ತ
ರ
ಹ
ತ
ಯ
ಯ
ಾ
ದ
ರ
ೂ
I
a
m
w
o
r
k
i
n
g
o
n
t
h
i
s
Those circles are kind of spaces [ As understood by the code ]
answered Nov 23 '18 at 6:32
MPJMPJ
301116
301116
add a comment |
add a comment |
Look at @itzMEonTV's suggestion:
In [46]: rex=re.compile(r'w+')
In [47]: rex
Out[47]: re.compile(r'w+', re.UNICODE)
add a comment |
Look at @itzMEonTV's suggestion:
In [46]: rex=re.compile(r'w+')
In [47]: rex
Out[47]: re.compile(r'w+', re.UNICODE)
add a comment |
Look at @itzMEonTV's suggestion:
In [46]: rex=re.compile(r'w+')
In [47]: rex
Out[47]: re.compile(r'w+', re.UNICODE)
Look at @itzMEonTV's suggestion:
In [46]: rex=re.compile(r'w+')
In [47]: rex
Out[47]: re.compile(r'w+', re.UNICODE)
answered Nov 23 '18 at 6:56
kantalkantal
642128
642128
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53441317%2fregular-expression-w-is-supposed-to-return-only-words-in-english-but-it-is%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Note to everyone: The OP's code appears to be working in this demo. I can only speculate that there is some weird encoding issue happening.
– Tim Biegeleisen
Nov 23 '18 at 6:08
@TimBiegeleisen It looks like the demo you linked uses Python 2 but I'm guessing Amarnath is using Python 3, which does exhibit the problem. Amarnath, can you please edit the question to confirm which version of Python you're using?
– David Z
Nov 23 '18 at 6:11
"I expected the above code to return only english words" - why?
– user2357112
Nov 23 '18 at 6:14
2
re.findall(r'w+',s, re.ASCII)?– itzMEonTV
Nov 23 '18 at 6:25
words = re.sub(r'[^a-zA-Z ]','',s)sub works fine but its different in findall not sure why– Cua
Nov 23 '18 at 6:42