Pyspark - UnicodeEncodeError: 'ascii' codec can't encode character





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.



# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())

u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)

spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()

sqlContext = SQLContext(spark)

l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()

data_tuples = [tuple(x) for x in df.rdd.collect()]

print(str(data_tuples))

print(type(data_tuples))

query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()


Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.



Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.



ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+

[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>

> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)


The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below



[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6









share|improve this question

























  • Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

    – user10465355
    Nov 23 '18 at 10:56











  • @user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

    – Joby
    Nov 23 '18 at 11:28











  • For python 2.7, the only way to fix would be to replace() occurrences of those strings.

    – karma4917
    Nov 23 '18 at 15:26











  • @karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

    – Joby
    Nov 26 '18 at 15:05











  • @user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

    – Joby
    Nov 26 '18 at 15:06


















1















Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.



# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())

u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)

spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()

sqlContext = SQLContext(spark)

l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()

data_tuples = [tuple(x) for x in df.rdd.collect()]

print(str(data_tuples))

print(type(data_tuples))

query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()


Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.



Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.



ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+

[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>

> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)


The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below



[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6









share|improve this question

























  • Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

    – user10465355
    Nov 23 '18 at 10:56











  • @user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

    – Joby
    Nov 23 '18 at 11:28











  • For python 2.7, the only way to fix would be to replace() occurrences of those strings.

    – karma4917
    Nov 23 '18 at 15:26











  • @karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

    – Joby
    Nov 26 '18 at 15:05











  • @user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

    – Joby
    Nov 26 '18 at 15:06














1












1








1








Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.



# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())

u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)

spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()

sqlContext = SQLContext(spark)

l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()

data_tuples = [tuple(x) for x in df.rdd.collect()]

print(str(data_tuples))

print(type(data_tuples))

query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()


Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.



Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.



ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+

[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>

> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)


The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below



[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6









share|improve this question
















Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.



# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())

u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)

spark = SparkSession.builder.master("local")
.appName("Unicode_Error")
.getOrCreate()

sqlContext = SQLContext(spark)

l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()

data_tuples = [tuple(x) for x in df.rdd.collect()]

print(str(data_tuples))

print(type(data_tuples))

query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()


Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.



Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.



ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+

[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcdxe9', 3), (340, u'Japan', 4), (341, u'Austrxf2alia', 5), (341, u'China', 6)]
<type 'list'>

> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in
> position 4: ordinal not in range(128)


The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below



[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6






python python-2.7 apache-spark pyspark cx-oracle






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 11:37







Joby

















asked Nov 23 '18 at 9:40









JobyJoby

538514




538514













  • Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

    – user10465355
    Nov 23 '18 at 10:56











  • @user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

    – Joby
    Nov 23 '18 at 11:28











  • For python 2.7, the only way to fix would be to replace() occurrences of those strings.

    – karma4917
    Nov 23 '18 at 15:26











  • @karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

    – Joby
    Nov 26 '18 at 15:05











  • @user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

    – Joby
    Nov 26 '18 at 15:06



















  • Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

    – user10465355
    Nov 23 '18 at 10:56











  • @user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

    – Joby
    Nov 23 '18 at 11:28











  • For python 2.7, the only way to fix would be to replace() occurrences of those strings.

    – karma4917
    Nov 23 '18 at 15:26











  • @karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

    – Joby
    Nov 26 '18 at 15:05











  • @user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

    – Joby
    Nov 26 '18 at 15:06

















Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56





Possible duplicate of UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

– user10465355
Nov 23 '18 at 10:56













@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28





@user10465355 had tried the encode but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way

– Joby
Nov 23 '18 at 11:28













For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26





For python 2.7, the only way to fix would be to replace() occurrences of those strings.

– karma4917
Nov 23 '18 at 15:26













@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05





@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....?

– Joby
Nov 26 '18 at 15:05













@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06





@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue.

– Joby
Nov 26 '18 at 15:06












1 Answer
1






active

oldest

votes


















1














This was resolved by passing additional params while connecting to Oracle via cx_Oracle.



Set the encoding method for the python environment to support the Unicode data handling



# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')


Supply the encoding properties in the cx_Oracle connect



con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")


You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444073%2fpyspark-unicodeencodeerror-ascii-codec-cant-encode-character%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    This was resolved by passing additional params while connecting to Oracle via cx_Oracle.



    Set the encoding method for the python environment to support the Unicode data handling



    # -*- coding: utf-8 -*-
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')


    Supply the encoding properties in the cx_Oracle connect



    con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")


    You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.






    share|improve this answer




























      1














      This was resolved by passing additional params while connecting to Oracle via cx_Oracle.



      Set the encoding method for the python environment to support the Unicode data handling



      # -*- coding: utf-8 -*-
      import sys
      reload(sys)
      sys.setdefaultencoding('utf-8')


      Supply the encoding properties in the cx_Oracle connect



      con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")


      You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.






      share|improve this answer


























        1












        1








        1







        This was resolved by passing additional params while connecting to Oracle via cx_Oracle.



        Set the encoding method for the python environment to support the Unicode data handling



        # -*- coding: utf-8 -*-
        import sys
        reload(sys)
        sys.setdefaultencoding('utf-8')


        Supply the encoding properties in the cx_Oracle connect



        con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")


        You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.






        share|improve this answer













        This was resolved by passing additional params while connecting to Oracle via cx_Oracle.



        Set the encoding method for the python environment to support the Unicode data handling



        # -*- coding: utf-8 -*-
        import sys
        reload(sys)
        sys.setdefaultencoding('utf-8')


        Supply the encoding properties in the cx_Oracle connect



        con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")


        You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 27 '18 at 7:15









        JobyJoby

        538514




        538514
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444073%2fpyspark-unicodeencodeerror-ascii-codec-cant-encode-character%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Paul Cézanne

            UIScrollView CustomStickyHeader Resize height generates problems when scroll is too fast

            Angular material date-picker (MatDatepicker) auto completes the date on focus out