I have successfully tested my simple c/c++ compiled code on Spark on YARN cluster.
Apart what is inside the c/c++ and what it does, the process of executing an external/compiled c/c++ was much easier than I though on the YARN cluster:
- Put the already compiled
c/c++code on HDFS to be pushed to all executors - Add the compiled file to the executors inline the code or with a flag (–files) by
spark-submit
Sample c code:
#include <stdio.h>
int main (int argc, char *argv[]) {
char str[100];
while (1) {
if (!fgets(str, 100, stdin)) {
return 0;
}
printf("Hello, %s", str);
}
}
Sample scala code:
val distScript = "hdfs:///user/maziyar/tmp/simple-c"
sc.addFile(distScript)
val names = sc.parallelize(Seq("Don", "Betty", "Sally"))
val piped = names.pipe(Seq("./simple-c"))
piped.collect().map(println(_))
names: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[20] at parallelize at <console>:28
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[21] at pipe at <console>:28
Hello, Don
Hello, Betty
Hello, Sally
Question: Let’s talk about why do you need you c++ code to write back the results to HDFS? Why not collect the results and write it with Spark or just collect and continue with your pipeline? (I am asking because c++ HDFS connection is very painful to compile, shaky, unstable especially in different environments)
, the pipeline goes as follows (approximately, as I have simplified some steps) :