Pyspark - UnicodeEncodeError: 'ascii' codec can't encode character
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.
# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())
u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)
spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()
sqlContext = SQLContext(spark)
l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()
data_tuples = [tuple(x) for x in df.rdd.collect()]
print(str(data_tuples))
print(type(data_tuples))
query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()
Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.
Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.
ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+
[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>
> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)
The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below
[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6
python python-2.7 apache-spark pyspark cx-oracle
|
show 1 more comment
Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.
# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())
u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)
spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()
sqlContext = SQLContext(spark)
l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()
data_tuples = [tuple(x) for x in df.rdd.collect()]
print(str(data_tuples))
print(type(data_tuples))
query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()
Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.
Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.
ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+
[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>
> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)
The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below
[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6
python python-2.7 apache-spark pyspark cx-oracle
Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
– user10465355
Nov 23 '18 at 10:56
@user10465355 had tried theencodebut that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way
– Joby
Nov 23 '18 at 11:28
For python 2.7, the only way to fix would be to replace() occurrences of those strings.
– karma4917
Nov 23 '18 at 15:26
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?
– Joby
Nov 26 '18 at 15:05
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.
– Joby
Nov 26 '18 at 15:06
|
show 1 more comment
Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.
# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())
u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)
spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()
sqlContext = SQLContext(spark)
l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()
data_tuples = [tuple(x) for x in df.rdd.collect()]
print(str(data_tuples))
print(type(data_tuples))
query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()
Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.
Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.
ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+
[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>
> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)
The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below
[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6
python python-2.7 apache-spark pyspark cx-oracle
Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.
# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())
u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)
spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()
sqlContext = SQLContext(spark)
l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()
data_tuples = [tuple(x) for x in df.rdd.collect()]
print(str(data_tuples))
print(type(data_tuples))
query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()
Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.
Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.
ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+
[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>
> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)
The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below
[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6
python python-2.7 apache-spark pyspark cx-oracle
python python-2.7 apache-spark pyspark cx-oracle
edited Nov 23 '18 at 11:37
Joby
asked Nov 23 '18 at 9:40
JobyJoby
538514
538514
Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
– user10465355
Nov 23 '18 at 10:56
@user10465355 had tried theencodebut that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way
– Joby
Nov 23 '18 at 11:28
For python 2.7, the only way to fix would be to replace() occurrences of those strings.
– karma4917
Nov 23 '18 at 15:26
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?
– Joby
Nov 26 '18 at 15:05
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.
– Joby
Nov 26 '18 at 15:06
|
show 1 more comment
Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
– user10465355
Nov 23 '18 at 10:56
@user10465355 had tried theencodebut that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way
– Joby
Nov 23 '18 at 11:28
For python 2.7, the only way to fix would be to replace() occurrences of those strings.
– karma4917
Nov 23 '18 at 15:26
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?
– Joby
Nov 26 '18 at 15:05
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.
– Joby
Nov 26 '18 at 15:06
Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
– user10465355
Nov 23 '18 at 10:56
Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
– user10465355
Nov 23 '18 at 10:56
@user10465355 had tried the
encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way– Joby
Nov 23 '18 at 11:28
@user10465355 had tried the
encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way– Joby
Nov 23 '18 at 11:28
For python 2.7, the only way to fix would be to replace() occurrences of those strings.
– karma4917
Nov 23 '18 at 15:26
For python 2.7, the only way to fix would be to replace() occurrences of those strings.
– karma4917
Nov 23 '18 at 15:26
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?
– Joby
Nov 26 '18 at 15:05
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?
– Joby
Nov 26 '18 at 15:05
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.
– Joby
Nov 26 '18 at 15:06
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.
– Joby
Nov 26 '18 at 15:06
|
show 1 more comment
1 Answer
1
active
oldest
votes
This was resolved by passing additional params while connecting to Oracle via cx_Oracle.
Set the encoding method for the python environment to support the Unicode data handling
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Supply the encoding properties in the cx_Oracle connect
con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")
You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444073%2fpyspark-unicodeencodeerror-ascii-codec-cant-encode-character%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This was resolved by passing additional params while connecting to Oracle via cx_Oracle.
Set the encoding method for the python environment to support the Unicode data handling
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Supply the encoding properties in the cx_Oracle connect
con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")
You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.
add a comment |
This was resolved by passing additional params while connecting to Oracle via cx_Oracle.
Set the encoding method for the python environment to support the Unicode data handling
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Supply the encoding properties in the cx_Oracle connect
con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")
You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.
add a comment |
This was resolved by passing additional params while connecting to Oracle via cx_Oracle.
Set the encoding method for the python environment to support the Unicode data handling
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Supply the encoding properties in the cx_Oracle connect
con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")
You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.
This was resolved by passing additional params while connecting to Oracle via cx_Oracle.
Set the encoding method for the python environment to support the Unicode data handling
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Supply the encoding properties in the cx_Oracle connect
con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")
You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.
answered Nov 27 '18 at 7:15
JobyJoby
538514
538514
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444073%2fpyspark-unicodeencodeerror-ascii-codec-cant-encode-character%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
– user10465355
Nov 23 '18 at 10:56
@user10465355 had tried the
encodebut that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way– Joby
Nov 23 '18 at 11:28
For python 2.7, the only way to fix would be to replace() occurrences of those strings.
– karma4917
Nov 23 '18 at 15:26
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?
– Joby
Nov 26 '18 at 15:05
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.
– Joby
Nov 26 '18 at 15:06