Pyspark - UnicodeEncodeError: 'ascii' codec can't encode character

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.

# -*- coding: utf-8 -*-

#import unicodedata

from pyspark.sql import SparkSession

from pyspark.sql import SQLContext

from pyspark.sql.types import *

from pyspark.sql.functions import udf

import sys

print(sys.getdefaultencoding())



u = 'abcdé'

a = 'Austròalia'

print(u)

print(a)



spark = SparkSession.builder.master("local") 

        .appName("Unicode_Error") 

        .getOrCreate()



sqlContext = SQLContext(spark)



l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]

df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])

df.show()



data_tuples = [tuple(x) for x in df.rdd.collect()]



print(str(data_tuples))



print(type(data_tuples))



query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"

cur = con.cursor()

cur.prepare(query)

cur.executemany(None, data_tuples)

con.commit()

cur.close()

con.close()

Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.

Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.

ascii

abcdé

Austròalia

+-----------+----------+----------+

|CUSTOMER_ID|   COUNTRY|LINENUMBER|

+-----------+----------+----------+

|        340|     India|         1|

|        340|    Canada|         2|

|        341|     abcdé|         3|

|        340|     Japan|         4|

|        341|Austròalia|         5|

|        341|     China|         6|

+-----------+----------+----------+



[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]

<type 'list'>



> Traceback (most recent call last): cur.executemany(None, data_tuples)

> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in

> position 4: ordinal not in range(128)

The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below

[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]

***********************

India

340

India

1

340

Canada

2

341

abcdé

3

340

Japan

4

341

Austròalia

5

341

China

6

edited Nov 23 '18 at 11:37

asked Nov 23 '18 at 9:40

Joby

538514

Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56

@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28

For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26

@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05

@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06

|
show 1 more comment

Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.

# -*- coding: utf-8 -*-

#import unicodedata

from pyspark.sql import SparkSession

from pyspark.sql import SQLContext

from pyspark.sql.types import *

from pyspark.sql.functions import udf

import sys

print(sys.getdefaultencoding())



u = 'abcdé'

a = 'Austròalia'

print(u)

print(a)



spark = SparkSession.builder.master("local") 

        .appName("Unicode_Error") 

        .getOrCreate()



sqlContext = SQLContext(spark)



l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]

df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])

df.show()



data_tuples = [tuple(x) for x in df.rdd.collect()]



print(str(data_tuples))



print(type(data_tuples))



query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"

cur = con.cursor()

cur.prepare(query)

cur.executemany(None, data_tuples)

con.commit()

cur.close()

con.close()

ascii

abcdé

Austròalia

+-----------+----------+----------+

|CUSTOMER_ID|   COUNTRY|LINENUMBER|

+-----------+----------+----------+

|        340|     India|         1|

|        340|    Canada|         2|

|        341|     abcdé|         3|

|        340|     Japan|         4|

|        341|Austròalia|         5|

|        341|     China|         6|

+-----------+----------+----------+



[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]

<type 'list'>



> Traceback (most recent call last): cur.executemany(None, data_tuples)

> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in

> position 4: ordinal not in range(128)

The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below

[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]

***********************

India

340

India

1

340

Canada

2

341

abcdé

3

340

Japan

4

341

Austròalia

5

341

China

6

edited Nov 23 '18 at 11:37

asked Nov 23 '18 at 9:40

Joby

538514

Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56

@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28

For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26

@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05

@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06

|
show 1 more comment

Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.

# -*- coding: utf-8 -*-

#import unicodedata

from pyspark.sql import SparkSession

from pyspark.sql import SQLContext

from pyspark.sql.types import *

from pyspark.sql.functions import udf

import sys

print(sys.getdefaultencoding())



u = 'abcdé'

a = 'Austròalia'

print(u)

print(a)



spark = SparkSession.builder.master("local") 

        .appName("Unicode_Error") 

        .getOrCreate()



sqlContext = SQLContext(spark)



l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]

df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])

df.show()



data_tuples = [tuple(x) for x in df.rdd.collect()]



print(str(data_tuples))



print(type(data_tuples))



query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"

cur = con.cursor()

cur.prepare(query)

cur.executemany(None, data_tuples)

con.commit()

cur.close()

con.close()

ascii

abcdé

Austròalia

+-----------+----------+----------+

|CUSTOMER_ID|   COUNTRY|LINENUMBER|

+-----------+----------+----------+

|        340|     India|         1|

|        340|    Canada|         2|

|        341|     abcdé|         3|

|        340|     Japan|         4|

|        341|Austròalia|         5|

|        341|     China|         6|

+-----------+----------+----------+



[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]

<type 'list'>



> Traceback (most recent call last): cur.executemany(None, data_tuples)

> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in

> position 4: ordinal not in range(128)

The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below

[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]

***********************

India

340

India

1

340

Canada

2

341

abcdé

3

340

Japan

4

341

Austròalia

5

341

China

6

edited Nov 23 '18 at 11:37

asked Nov 23 '18 at 9:40

Joby

538514

Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.

# -*- coding: utf-8 -*-

#import unicodedata

from pyspark.sql import SparkSession

from pyspark.sql import SQLContext

from pyspark.sql.types import *

from pyspark.sql.functions import udf

import sys

print(sys.getdefaultencoding())



u = 'abcdé'

a = 'Austròalia'

print(u)

print(a)



spark = SparkSession.builder.master("local") 

        .appName("Unicode_Error") 

        .getOrCreate()



sqlContext = SQLContext(spark)



l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]

df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])

df.show()



data_tuples = [tuple(x) for x in df.rdd.collect()]



print(str(data_tuples))



print(type(data_tuples))



query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"

cur = con.cursor()

cur.prepare(query)

cur.executemany(None, data_tuples)

con.commit()

cur.close()

con.close()

ascii

abcdé

Austròalia

+-----------+----------+----------+

|CUSTOMER_ID|   COUNTRY|LINENUMBER|

+-----------+----------+----------+

|        340|     India|         1|

|        340|    Canada|         2|

|        341|     abcdé|         3|

|        340|     Japan|         4|

|        341|Austròalia|         5|

|        341|     China|         6|

+-----------+----------+----------+



[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]

<type 'list'>



> Traceback (most recent call last): cur.executemany(None, data_tuples)

> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in

> position 4: ordinal not in range(128)

The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below

[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]

***********************

India

340

India

1

340

Canada

2

341

abcdé

3

340

Japan

4

341

Austròalia

5

341

China

6

python python-2.7 apache-spark pyspark cx-oracle

edited Nov 23 '18 at 11:37

asked Nov 23 '18 at 9:40

Joby

538514

edited Nov 23 '18 at 11:37

asked Nov 23 '18 at 9:40

Joby

538514

edited Nov 23 '18 at 11:37

asked Nov 23 '18 at 9:40

Joby

538514

asked Nov 23 '18 at 9:40

Joby

538514

asked Nov 23 '18 at 9:40

Joby

538514

Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56

@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28

For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26

@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05

@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06

|
show 1 more comment

Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56

@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28

For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26

@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05

@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06

Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56

@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28

For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26

@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05

@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06

|
show 1 more comment

1 Answer
1

active

oldest

votes

This was resolved by passing additional params while connecting to Oracle via cx_Oracle.

Set the encoding method for the python environment to support the Unicode data handling

# -*- coding: utf-8 -*-

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

Supply the encoding properties in the cx_Oracle connect

con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")

You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.

answered Nov 27 '18 at 7:15

Joby

538514

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444073%2fpyspark-unicodeencodeerror-ascii-codec-cant-encode-character%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This was resolved by passing additional params while connecting to Oracle via cx_Oracle.

Set the encoding method for the python environment to support the Unicode data handling

# -*- coding: utf-8 -*-

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

Supply the encoding properties in the cx_Oracle connect

con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")

You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.

answered Nov 27 '18 at 7:15

Joby

538514

add a comment |

This was resolved by passing additional params while connecting to Oracle via cx_Oracle.

Set the encoding method for the python environment to support the Unicode data handling

# -*- coding: utf-8 -*-

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

Supply the encoding properties in the cx_Oracle connect

con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")

You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.

answered Nov 27 '18 at 7:15

Joby

538514

add a comment |

This was resolved by passing additional params while connecting to Oracle via cx_Oracle.

Set the encoding method for the python environment to support the Unicode data handling

# -*- coding: utf-8 -*-

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

Supply the encoding properties in the cx_Oracle connect

con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")

You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.

answered Nov 27 '18 at 7:15

Joby

538514

This was resolved by passing additional params while connecting to Oracle via cx_Oracle.

Set the encoding method for the python environment to support the Unicode data handling

# -*- coding: utf-8 -*-

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

Supply the encoding properties in the cx_Oracle connect

con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")

You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.

answered Nov 27 '18 at 7:15

Joby

538514

answered Nov 27 '18 at 7:15

Joby

538514

answered Nov 27 '18 at 7:15

Joby

538514

answered Nov 27 '18 at 7:15

Joby

538514

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr