How to compile C/C++ code against native Hadoop library on Multivac cluster

mpanahi · 2018-11-16 09:09:06 UTC

I am very close to solve this issue so we can all move on to developing and testing!

I have found where are the right headers and native Hadoop libraries on Multivac cluster. So the way it should be linked against Hadoop on Multivac is as follow:

# Optional
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf

# Very important when you are executing the C/C++ code. This helps the JAVA to find and link the libraries (classpath) if I am not mistaken
link the libraries if I am not mistaken
export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server:/opt/cloudera/parcels/CDH/lib/
export CLASSPATH=$CLASSPATH:`hadoop classpath --glob`

# How to link C/C++ to the right headers and Hadoop native libraries
g++ -o cpp_exe2 main.cpp  -std=c++11 -I/opt/cloudera/parcels/CDH/include/ -L/opt/cloudera/parcels/CDH/lib/ -lhdfs

I have tested the compiled version from the server:
```bash
./cpp_exe2 #enter
/user/[USERNAME]/tmp/test.txt # enter

This successfully created a file test.txt on HDFS on the given path! (hooray)

Now, I am going to solve the other issue, this compile version cannot be executed by Spark RDD Pipe on the cluster because other machines/executors don’t have access to LD_LIBRARY_PATH path.
This shouldn’t be hard, I am going to find a way to either add this to Spark session or distribute this on all clients as a config with Zookeeper.

This is the GitHub repo I made to put some codes and instruction for the future. It is not complete yet, but you are welcome to keep an eye on it

mpanahi · 2018-11-17 19:02:34 UTC

Hi @lcaraffa,

I did manage to compile and even run it in the cluster successfully. However, there is an issue with permissions when the app is called by spark pipe() in the cluster mode.

Could you do me a favor and give me a sample of your input file? Also, a sample of an output file from the first iteration? I don’t need the algorithm, all I need is to test whether it’s possible to send files and write them into HDFS from the Spark. (ex: saveAsTextFile right after pipe without collecting so it will right it directly to HDFS without collecting it)

Thanks,

lcaraffa · 2018-11-18 15:12:15 UTC

Ho that’s a very good news for the test file! Yes for sure I’ll send you a sample.
Iet me know what’s the exact issue with the permission.
In the meantime, I’ll continue to work on a simplified version of the full pipeline, we should have something that works very soon

Thanks!
Laurent.

mpanahi · 2018-11-19 12:02:13 UTC

Hi @lcaraffa,

I have fixed the permission problem. Normally, YARN containers are launched by a default user such as yarn or nobody which is no problem. Spark knows how to impersonate the real user so you have correct username with all the appropriate permissions on HDFS.
However, Spark RDD Pipe() lunches a container with user yarn, in that case your C/C++ application has no way to read/write on any HDFS directory as user yarn.

I have fixed this issue by asking Hadoop to respect the Linux users/groups connected to our LDAP.

Now, we can compile the C++ you gave me, add it correctly to YARN, call Spark Pipe(), and ask C++ application to read/write from HDFS!

I am preparing your account to have access to Multivac DSL (Notebooks, Hue, and Gateway server).

mpanahi · 2018-11-19 12:13:59 UTC

Hi again @lcaraffa,

Mathieu already has a username in our LDAP, so I can add him to Multivac DSL.

Could you please apply for ISC-PIF service from here (you can choose Multivac Data Science Lab):

https://iscpif.fr/apply-for-a-services-account/

Thank you,

lcaraffa · 2018-11-19 15:01:44 UTC

Ok that’s perfect, thank you!
Ok I applied for the ISC-PIF service.
I just shared with you point clouds from 2D and 3D datasets.

mpanahi · 2018-11-19 15:27:16 UTC

Thanks @lcaraffa, I have sent you the instruction.

Let me know if you have any issue to start. Also, you are welcome to bring your computer to ISC-PIF and work from here for a day

mpanahi · 2018-11-20 14:03:36 UTC

Hi @lcaraffa,

Would it be possible instead of the compiled C/C++ app we call a bash script which calls the C/C++ application by passing the inputs? (in this situation, the script can set some env variables before calling the C/C++ application and it helps me to isolate them to this situation)

mpanahi · 2018-11-20 17:05:14 UTC

Something like this will work perfectly:

#!/bin/bash
filePath=$1
export CLASSPATH=`hadoop classpath --glob`
export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server:/opt/cloudera/parcels/CDH/lib/

echo "Running shell script $filePath"
./cpp_exe $filePath

Here, we can set the CLASSPATH which is needed to run the already compiled C/C++ application.

lcaraffa · 2018-11-21 10:37:36 UTC

That’s ok to embed my c++ call into a bash script, it should work.
Yes I should come, it will be way more easy to progress ! I’m done with giving lessons this week, I can be 100% on it.
I’ll send you a mp.

Thank you,
Laurent.

mpanahi · 2018-11-21 11:29:13 UTC

Sure I am available Friday, please bring your laptop and we can work on it together.

I will give you some instruction of how to connect to the machine that is a Gateway to Hadoop cluster and allows you to put or copy files to or from HDFS, run spark-shell on cluster, run spark-submit on cluster, and have some files locally.
Also, I’ll show you some documentations about how to use Apache Zeppelin (it is an open source project https://zeppelin.apache.org/)

See you Friday morning?

lcaraffa · 2018-11-21 14:53:23 UTC

Ok for Friday morning, that’s a very good program, thanks!
In the meantime I am preparing a simplified version of the algorithm.

Laurent.

mpanahi · 2018-11-21 22:32:14 UTC

That’s great! a simple algorithm and some data to iterate through would be a perfect test run to be sure everything works smoothly as it should.

So see you on Friday

mpanahi · 2019-03-28 09:06:24 UTC