Hi @multivac-dsl,
All the services related to Multivac DSL (Spark, HDFS, Hive, etc.) will be unreachable today due to a major upgrade.
Hope everything goes well without any loss
Hi @multivac-dsl,
All the services related to Multivac DSL (Spark, HDFS, Hive, etc.) will be unreachable today due to a major upgrade.
Hope everything goes well without any loss
I have successfully upgraded Cloudera to 6.1 which is based on Hadoop 3.x with many new changes. This was a major upgrade, therefore there might be some parts of your pipeline that doesn’t work as it should.
The full list of incompatible changes in 6.0.0:
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_incompatible_changes.html#hive_hos_hcatalog_incompatible_changes_c6
Big problem: Spark 2.4 is not supported on the latest release of Zeppelin. I am working on this to see if I can build it manually and fix it.
Please let me know if you have any problem with your workflow, we’ll find a way to make it compatible again.
Issue with Zeppelin has been resolved. Now it supports Spark 2.4 and Hadoop 3.0!
There is a problem with spark.read.csv
and spark.read.json
in new Spark 2.4 and Zeppelin. I am trying to fix this issue.
Hi,
Thanks for the update.
I was using spark.read.csv and spark.read.json.
Is there another way to read files from hadoop?
Or there is no need to find a workaround as a fix should be available soon?
Noé
Hi Noe,
Sorry I am working on it with team at Zeppelin and Cloudera. I will find a workaround for you to read CSV and JSON files. It’s only for JSON/CSV. The text or parquet don’t face this error of serialization.
I’ll let you know soon.
PS: I found a easier way for your UDF without broadcasting and simply using case/otherwise
Hi @ngaumont,
I have changed conflicted dependencies in Zeppelin and re-built again. The problem with json
and csv
seems resolved after these changes.
Please let me know if you experience any issue.
PS: I’ll send you a solution for your UDF/broadcast problem tomorrow.
Best,
Maziyar
Hi @mpanahi
Now on zeppelin, I can’t retrieve data from Hive with the following code:
val hiveMainTable = spark.sql("""
SELECT
HIDDEN
""")
hiveMainTable.printSchema
hiveMainTable.show(false)
I have this error:
java.lang.NoSuchMethodError: com.facebook.fb303.FacebookService$Client.sendBaseOneway(Ljava/lang/String;Lorg/apache/thrift/TBase;)V
at com.facebook.fb303.FacebookService$Client.send_shutdown(FacebookService.java:436)
at com.facebook.fb303.FacebookService$Client.shutdown(FacebookService.java:430)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.close(HiveMetaStoreClient.java:606)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
at com.sun.proxy.$Proxy30.close(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2477)
at com.sun.proxy.$Proxy30.close(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.close(Hive.java:414)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:330)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:317)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293)
at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:920)
at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:884)
at org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1546)
at org.apache.hadoop.hive.ql.session.SessionState.getUserFromAuthenticator(SessionState.java:1234)
at org.apache.hadoop.hive.ql.metadata.Table.getEmptyTable(Table.java:181)
at org.apache.hadoop.hive.ql.metadata.Table.<init>(Table.java:123)
at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:927)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:685)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex.listFiles(CatalogFileIndex.scala:59)
at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$FileSourceScanExec$$selectedPartitions$lzycompute(DataSourceScanExec.scala:191)
at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$FileSourceScanExec$$selectedPartitions(DataSourceScanExec.scala:188)
at org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$22.apply(DataSourceScanExec.scala:290)
at org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$22.apply(DataSourceScanExec.scala:289)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.execution.FileSourceScanExec.metadata$lzycompute(DataSourceScanExec.scala:289)
at org.apache.spark.sql.execution.FileSourceScanExec.metadata(DataSourceScanExec.scala:275)
at org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:55)
at org.apache.spark.sql.execution.FileSourceScanExec.simpleString(DataSourceScanExec.scala:159)
at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:177)
at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:159)
at org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:63)
at org.apache.spark.sql.execution.FileSourceScanExec.verboseString(DataSourceScanExec.scala:159)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:568)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:568)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:207)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:207)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:99)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:207)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:75)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:747)
at org.apache.spark.sql.Dataset.show(Dataset.scala:724)
... 47 elided
Are there any issues with the cluster?
Noé
Hi,
Yes, there is another incompatibility between Zeppelin and new Hadoop which is libthrift
. I will see if I can upgrade this to newer version or shade it somehow.
Meanwhile, you can access the same data from reading the parquet
files:
val hiveMainTable = spark.read.parquet("REMOVE_FOR_SECURITY")
Since these are partitioned by year, you can even get more specific by adding the year to the end of this path:
REMOVE_FOR_SECURITY/2017
Thanks for reporting this.
I’ll keep you updated for accessing Hive tables.
Best,
Maziyar
Hi Maziyar,
Thanks for the update!
For my part, I have an issue with the hdfs I/O from c++.
When I call it from the spark shell like this :
sc.parallelize(List(
“hdfs:/user/lcaraffa/output_2/hello_word.file”
)).pipe(command = “cpp_exe”, env = env_map_multivacs_2).collect()
I have an error in my c++ code when I try to open the stream on hdfs.
But when I’m doing it from my home, it’s working :
echo “hdfs:/user/lcaraffa/output_2/hello_word.file” | ./cpp_exe
do you have any idea?
Thanks
Laurent.
Ok problem solved :
my environement variable was first defined by this hack :
var env_map_multivacs_2 = Map(
“CLASSPATH” -> “hadoop classpath --glob”.!!
)
Because it’s working on my local session, I just had to replace this with :
var env_map_multivacs = Map(
“CLASSPATH” -> sys.env(“CLASSPATH”)
)
Witch makes more sens retrospectively