Spark show works for whole dataframe yet fails for same dataframe filtered

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:

val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")

When I call

df.show

it shows first 20 rows.
But when I call

df.where("src = 'XXXXX' and dst = 'YYYYY'").show

It gives following error:

        org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null

    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

    at org.apache.spark.scheduler.Task.run(Task.scala:125)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)



Driver stacktrace:

  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)

  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at scala.Option.foreach(Option.scala:257)

  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)

  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:691)

  ... 56 elided

Caused by: org.apache.spark.util.TaskCompletionListenerException: null

  at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

  at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

  at org.apache.spark.scheduler.Task.run(Task.scala:125)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  ... 3 more

Here is the properties of Hive table:

      CREATE TABLE `EBYN.BABS_EDGES_2016    `(

  `date` string, 

  `a_vno` string, 

  `s_vno` string, 

  `amount` double, 

  `number` int, 

  `share` int, 

  `share_ratio` int, 

  `administrator` int, 

  `account` int, 

  `all_sharelik` int)

COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'

ROW FORMAT SERDE 

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 

WITH SERDEPROPERTIES ( 

  'field.delim'='', 

  'line.delim'='n', 

  'serialization.format'='') 

STORED AS INPUTFORMAT 

  'org.apache.hadoop.mapred.TextInputFormat' 

OUTPUTFORMAT 

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016    '

TBLPROPERTIES (

  'bucketing_version'='2', 

  'last_modified_by'='hadoop_etluser', 

  'last_modified_time'='1539867401', 

  'transactional'='true', 

  'transactional_properties'='insert_only',

What is the reason that it shows dataframe but fails when called for filtered dataframe?

edited Nov 26 '18 at 8:48

asked Nov 23 '18 at 14:54

Gofrette

181315

add a comment |

I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:

val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")

When I call

df.show

it shows first 20 rows.
But when I call

df.where("src = 'XXXXX' and dst = 'YYYYY'").show

It gives following error:

        org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null

    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

    at org.apache.spark.scheduler.Task.run(Task.scala:125)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)



Driver stacktrace:

  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)

  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at scala.Option.foreach(Option.scala:257)

  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)

  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:691)

  ... 56 elided

Caused by: org.apache.spark.util.TaskCompletionListenerException: null

  at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

  at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

  at org.apache.spark.scheduler.Task.run(Task.scala:125)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  ... 3 more

Here is the properties of Hive table:

      CREATE TABLE `EBYN.BABS_EDGES_2016    `(

  `date` string, 

  `a_vno` string, 

  `s_vno` string, 

  `amount` double, 

  `number` int, 

  `share` int, 

  `share_ratio` int, 

  `administrator` int, 

  `account` int, 

  `all_sharelik` int)

COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'

ROW FORMAT SERDE 

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 

WITH SERDEPROPERTIES ( 

  'field.delim'='', 

  'line.delim'='n', 

  'serialization.format'='') 

STORED AS INPUTFORMAT 

  'org.apache.hadoop.mapred.TextInputFormat' 

OUTPUTFORMAT 

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016    '

TBLPROPERTIES (

  'bucketing_version'='2', 

  'last_modified_by'='hadoop_etluser', 

  'last_modified_time'='1539867401', 

  'transactional'='true', 

  'transactional_properties'='insert_only',

What is the reason that it shows dataframe but fails when called for filtered dataframe?

edited Nov 26 '18 at 8:48

asked Nov 23 '18 at 14:54

Gofrette

181315

add a comment |

I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:

val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")

When I call

df.show

it shows first 20 rows.
But when I call

df.where("src = 'XXXXX' and dst = 'YYYYY'").show

It gives following error:

        org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null

    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

    at org.apache.spark.scheduler.Task.run(Task.scala:125)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)



Driver stacktrace:

  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)

  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at scala.Option.foreach(Option.scala:257)

  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)

  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:691)

  ... 56 elided

Caused by: org.apache.spark.util.TaskCompletionListenerException: null

  at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

  at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

  at org.apache.spark.scheduler.Task.run(Task.scala:125)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  ... 3 more

Here is the properties of Hive table:

      CREATE TABLE `EBYN.BABS_EDGES_2016    `(

  `date` string, 

  `a_vno` string, 

  `s_vno` string, 

  `amount` double, 

  `number` int, 

  `share` int, 

  `share_ratio` int, 

  `administrator` int, 

  `account` int, 

  `all_sharelik` int)

COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'

ROW FORMAT SERDE 

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 

WITH SERDEPROPERTIES ( 

  'field.delim'='', 

  'line.delim'='n', 

  'serialization.format'='') 

STORED AS INPUTFORMAT 

  'org.apache.hadoop.mapred.TextInputFormat' 

OUTPUTFORMAT 

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016    '

TBLPROPERTIES (

  'bucketing_version'='2', 

  'last_modified_by'='hadoop_etluser', 

  'last_modified_time'='1539867401', 

  'transactional'='true', 

  'transactional_properties'='insert_only',

What is the reason that it shows dataframe but fails when called for filtered dataframe?

edited Nov 26 '18 at 8:48

asked Nov 23 '18 at 14:54

Gofrette

181315

I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:

val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")

When I call

df.show

it shows first 20 rows.
But when I call

df.where("src = 'XXXXX' and dst = 'YYYYY'").show

It gives following error:

        org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null

    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

    at org.apache.spark.scheduler.Task.run(Task.scala:125)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)



Driver stacktrace:

  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)

  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

  at scala.Option.foreach(Option.scala:257)

  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)

  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)

  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)

  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:691)

  ... 56 elided

Caused by: org.apache.spark.util.TaskCompletionListenerException: null

  at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)

  at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)

  at org.apache.spark.scheduler.Task.run(Task.scala:125)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  ... 3 more

Here is the properties of Hive table:

      CREATE TABLE `EBYN.BABS_EDGES_2016    `(

  `date` string, 

  `a_vno` string, 

  `s_vno` string, 

  `amount` double, 

  `number` int, 

  `share` int, 

  `share_ratio` int, 

  `administrator` int, 

  `account` int, 

  `all_sharelik` int)

COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'

ROW FORMAT SERDE 

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 

WITH SERDEPROPERTIES ( 

  'field.delim'='', 

  'line.delim'='n', 

  'serialization.format'='') 

STORED AS INPUTFORMAT 

  'org.apache.hadoop.mapred.TextInputFormat' 

OUTPUTFORMAT 

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016    '

TBLPROPERTIES (

  'bucketing_version'='2', 

  'last_modified_by'='hadoop_etluser', 

  'last_modified_time'='1539867401', 

  'transactional'='true', 

  'transactional_properties'='insert_only',

What is the reason that it shows dataframe but fails when called for filtered dataframe?

scala apache-spark apache-spark-sql

edited Nov 26 '18 at 8:48

asked Nov 23 '18 at 14:54

Gofrette

181315

edited Nov 26 '18 at 8:48

asked Nov 23 '18 at 14:54

Gofrette

181315

edited Nov 26 '18 at 8:48

asked Nov 23 '18 at 14:54

Gofrette

181315

asked Nov 23 '18 at 14:54

Gofrette

181315

asked Nov 23 '18 at 14:54

Gofrette

181315

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448862%2fspark-show-works-for-whole-dataframe-yet-fails-for-same-dataframe-filtered%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr