Upgrading Multivac Hadoop Cluster

mpanahi · 2019-01-07 12:39:45 UTC

All the services related to Multivac DSL (Spark, HDFS, Hive, etc.) will be unreachable today due to a major upgrade.

Hope everything goes well without any loss

mpanahi · 2019-01-07 14:30:28 UTC

I have successfully upgraded Cloudera to 6.1 which is based on Hadoop 3.x with many new changes. This was a major upgrade, therefore there might be some parts of your pipeline that doesn’t work as it should.

The full list of incompatible changes in 6.0.0:
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_incompatible_changes.html#hive_hos_hcatalog_incompatible_changes_c6

Big problem: Spark 2.4 is not supported on the latest release of Zeppelin. I am working on this to see if I can build it manually and fix it.

Please let me know if you have any problem with your workflow, we’ll find a way to make it compatible again.

mpanahi · 2019-01-08 09:55:59 UTC

Issue with Zeppelin has been resolved. Now it supports Spark 2.4 and Hadoop 3.0!

mpanahi · 2019-01-08 15:42:53 UTC

There is a problem with spark.read.csv and spark.read.json in new Spark 2.4 and Zeppelin. I am trying to fix this issue.

ngaumont · 2019-01-14 10:01:07 UTC

Hi,

Thanks for the update.
I was using spark.read.csv and spark.read.json.
Is there another way to read files from hadoop?
Or there is no need to find a workaround as a fix should be available soon?

Noé

mpanahi · 2019-01-17 12:29:16 UTC

Hi Noe,

Sorry I am working on it with team at Zeppelin and Cloudera. I will find a workaround for you to read CSV and JSON files. It’s only for JSON/CSV. The text or parquet don’t face this error of serialization.

I’ll let you know soon.
PS: I found a easier way for your UDF without broadcasting and simply using case/otherwise

mpanahi · 2019-01-19 16:34:29 UTC

Hi @ngaumont,

I have changed conflicted dependencies in Zeppelin and re-built again. The problem with json and csv seems resolved after these changes.

Please let me know if you experience any issue.
PS: I’ll send you a solution for your UDF/broadcast problem tomorrow.

Best,
Maziyar

ngaumont · 2019-01-21 08:04:38 UTC

Hi @mpanahi

Now on zeppelin, I can’t retrieve data from Hive with the following code:

val hiveMainTable = spark.sql("""
    SELECT 
        HIDDEN
    """)

hiveMainTable.printSchema
hiveMainTable.show(false)

I have this error:


java.lang.NoSuchMethodError: com.facebook.fb303.FacebookService$Client.sendBaseOneway(Ljava/lang/String;Lorg/apache/thrift/TBase;)V
  at com.facebook.fb303.FacebookService$Client.send_shutdown(FacebookService.java:436)
  at com.facebook.fb303.FacebookService$Client.shutdown(FacebookService.java:430)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.close(HiveMetaStoreClient.java:606)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
  at com.sun.proxy.$Proxy30.close(Unknown Source)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2477)
  at com.sun.proxy.$Proxy30.close(Unknown Source)
  at org.apache.hadoop.hive.ql.metadata.Hive.close(Hive.java:414)
  at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:330)
  at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:317)
  at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293)
  at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:920)
  at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:884)
  at org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1546)
  at org.apache.hadoop.hive.ql.session.SessionState.getUserFromAuthenticator(SessionState.java:1234)
  at org.apache.hadoop.hive.ql.metadata.Table.getEmptyTable(Table.java:181)
  at org.apache.hadoop.hive.ql.metadata.Table.<init>(Table.java:123)
  at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:927)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:685)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
  at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
  at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
  at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
  at org.apache.spark.sql.execution.datasources.CatalogFileIndex.listFiles(CatalogFileIndex.scala:59)
  at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$FileSourceScanExec$$selectedPartitions$lzycompute(DataSourceScanExec.scala:191)
  at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$FileSourceScanExec$$selectedPartitions(DataSourceScanExec.scala:188)
  at org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$22.apply(DataSourceScanExec.scala:290)
  at org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$22.apply(DataSourceScanExec.scala:289)
  at scala.Option.map(Option.scala:146)
  at org.apache.spark.sql.execution.FileSourceScanExec.metadata$lzycompute(DataSourceScanExec.scala:289)
  at org.apache.spark.sql.execution.FileSourceScanExec.metadata(DataSourceScanExec.scala:275)
  at org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:55)
  at org.apache.spark.sql.execution.FileSourceScanExec.simpleString(DataSourceScanExec.scala:159)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:177)
  at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:159)
  at org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:63)
  at org.apache.spark.sql.execution.FileSourceScanExec.verboseString(DataSourceScanExec.scala:159)
  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:568)
  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:568)
  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:207)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:207)
  at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:99)
  at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:207)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:75)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:747)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:724)
  ... 47 elided

Are there any issues with the cluster?

Noé

mpanahi · 2019-01-21 08:56:07 UTC

Hi,

Yes, there is another incompatibility between Zeppelin and new Hadoop which is libthrift. I will see if I can upgrade this to newer version or shade it somehow.

Meanwhile, you can access the same data from reading the parquet files:

val hiveMainTable = spark.read.parquet("REMOVE_FOR_SECURITY")

Since these are partitioned by year, you can even get more specific by adding the year to the end of this path:
REMOVE_FOR_SECURITY/2017

Thanks for reporting this.

I’ll keep you updated for accessing Hive tables.

Best,
Maziyar

mpanahi · 2019-01-24 11:29:32 UTC

problem with hive has been resolved in Zeppelin

lcaraffa · 2019-02-06 15:11:11 UTC

Hi Maziyar,

Thanks for the update!
For my part, I have an issue with the hdfs I/O from c++.

When I call it from the spark shell like this :

sc.parallelize(List(
“hdfs:/user/lcaraffa/output_2/hello_word.file”
)).pipe(command = “cpp_exe”, env = env_map_multivacs_2).collect()
I have an error in my c++ code when I try to open the stream on hdfs.

But when I’m doing it from my home, it’s working :

echo “hdfs:/user/lcaraffa/output_2/hello_word.file” | ./cpp_exe

do you have any idea?
Thanks
Laurent.

lcaraffa · 2019-02-07 09:57:13 UTC

Ok problem solved :
my environement variable was first defined by this hack :

var env_map_multivacs_2 = Map(
“CLASSPATH” -> “hadoop classpath --glob”.!!
)

Because it’s working on my local session, I just had to replace this with :
var env_map_multivacs = Map(
“CLASSPATH” -> sys.env(“CLASSPATH”)
)
Witch makes more sens retrospectively

mpanahi · 2019-02-13 17:51:11 UTC

Hi @lcaraffa,

Sorry the notifications on Discourse wete down so I missed your post here. I’m happy it worked out.

Don’t hesitate to post back here, I leave this open until I’m sure everything is compatible with the new updates.

mpanahi · 2019-03-28 09:05:32 UTC