java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame











up vote
0
down vote

favorite












I'm running a Spark cluster in standalone mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.



I'm trying to load data into a PySpark session so I can work on Spark DataFrames.



Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg



from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# a try
df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)

# another try
sql_ctx = SQLContext(sc)
df = sql_ctx.read.csv('/path/to/file.csv', header=True)

# and a few other tries...


Every time, I get the same error:




Py4JJavaError: An error occurred while calling o81.csv. :



org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, 192.168.X.X, executor 0):



java.io.StreamCorruptedException: invalid stream header: 0000000B




I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.



Does someone understand what is the problem?










share|improve this question


























    up vote
    0
    down vote

    favorite












    I'm running a Spark cluster in standalone mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.



    I'm trying to load data into a PySpark session so I can work on Spark DataFrames.



    Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg



    from pyspark.conf import SparkConf
    from pyspark.context import SparkContext
    from pyspark.sql import SQLContext

    conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
    sc = SparkContext(conf=conf)
    spark = SparkSession.builder.getOrCreate()

    # a try
    df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)

    # another try
    sql_ctx = SQLContext(sc)
    df = sql_ctx.read.csv('/path/to/file.csv', header=True)

    # and a few other tries...


    Every time, I get the same error:




    Py4JJavaError: An error occurred while calling o81.csv. :



    org.apache.spark.SparkException: Job aborted due to stage failure:
    Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
    in stage 0.0 (TID 3, 192.168.X.X, executor 0):



    java.io.StreamCorruptedException: invalid stream header: 0000000B




    I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.



    Does someone understand what is the problem?










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I'm running a Spark cluster in standalone mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.



      I'm trying to load data into a PySpark session so I can work on Spark DataFrames.



      Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg



      from pyspark.conf import SparkConf
      from pyspark.context import SparkContext
      from pyspark.sql import SQLContext

      conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
      sc = SparkContext(conf=conf)
      spark = SparkSession.builder.getOrCreate()

      # a try
      df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)

      # another try
      sql_ctx = SQLContext(sc)
      df = sql_ctx.read.csv('/path/to/file.csv', header=True)

      # and a few other tries...


      Every time, I get the same error:




      Py4JJavaError: An error occurred while calling o81.csv. :



      org.apache.spark.SparkException: Job aborted due to stage failure:
      Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
      in stage 0.0 (TID 3, 192.168.X.X, executor 0):



      java.io.StreamCorruptedException: invalid stream header: 0000000B




      I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.



      Does someone understand what is the problem?










      share|improve this question













      I'm running a Spark cluster in standalone mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.



      I'm trying to load data into a PySpark session so I can work on Spark DataFrames.



      Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg



      from pyspark.conf import SparkConf
      from pyspark.context import SparkContext
      from pyspark.sql import SQLContext

      conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
      sc = SparkContext(conf=conf)
      spark = SparkSession.builder.getOrCreate()

      # a try
      df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)

      # another try
      sql_ctx = SQLContext(sc)
      df = sql_ctx.read.csv('/path/to/file.csv', header=True)

      # and a few other tries...


      Every time, I get the same error:




      Py4JJavaError: An error occurred while calling o81.csv. :



      org.apache.spark.SparkException: Job aborted due to stage failure:
      Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
      in stage 0.0 (TID 3, 192.168.X.X, executor 0):



      java.io.StreamCorruptedException: invalid stream header: 0000000B




      I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.



      Does someone understand what is the problem?







      apache-spark pyspark pyspark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 13 at 17:01









      edouardtheron

      10618




      10618
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          To whom it may concern, I finally figured out the problem thank to this response.



          pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).



          Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53286071%2fjava-io-streamcorruptedexception-when-importing-a-csv-to-a-spark-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote



            accepted










            To whom it may concern, I finally figured out the problem thank to this response.



            pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).



            Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm






            share|improve this answer



























              up vote
              0
              down vote



              accepted










              To whom it may concern, I finally figured out the problem thank to this response.



              pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).



              Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm






              share|improve this answer

























                up vote
                0
                down vote



                accepted







                up vote
                0
                down vote



                accepted






                To whom it may concern, I finally figured out the problem thank to this response.



                pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).



                Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm






                share|improve this answer














                To whom it may concern, I finally figured out the problem thank to this response.



                pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).



                Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 18 at 20:07

























                answered Nov 18 at 16:52









                edouardtheron

                10618




                10618






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53286071%2fjava-io-streamcorruptedexception-when-importing-a-csv-to-a-spark-dataframe%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

                    Alcedinidae

                    Origin of the phrase “under your belt”?