I have successfully tested my simple c/c++
compiled code on Spark on YARN cluster.
Apart what is inside the c/c++
and what it does, the process of executing an external/compiled c/c++
was much easier than I though on the YARN cluster:
- Put the already compiled
c/c++
code on HDFS to be pushed to all executors - Add the compiled file to the executors inline the code or with a flag (–files) by
spark-submit
Sample c
code:
#include <stdio.h>
int main (int argc, char *argv[]) {
char str[100];
while (1) {
if (!fgets(str, 100, stdin)) {
return 0;
}
printf("Hello, %s", str);
}
}
Sample scala
code:
val distScript = "hdfs:///user/maziyar/tmp/simple-c"
sc.addFile(distScript)
val names = sc.parallelize(Seq("Don", "Betty", "Sally"))
val piped = names.pipe(Seq("./simple-c"))
piped.collect().map(println(_))
names: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[20] at parallelize at <console>:28
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[21] at pipe at <console>:28
Hello, Don
Hello, Betty
Hello, Sally
Question: Let’s talk about why do you need you c++
code to write back the results to HDFS
? Why not collect the results and write it with Spark
or just collect and continue with your pipeline? (I am asking because c++
HDFS connection is very painful to compile, shaky, unstable especially in different environments)