EPIQUE: direct access to HDFS and Spark


(Hubert Naacke) #1

@mpanahi Hi Maziyar. For the Epique project we would need tu upload/download large datasets in HDFS. We would also need to run several topic detection and mining algorithms using spark in a batch manner (using spark-submit for instance).
Do we have such direct access to multivac without going through the notebook, but rather using ssh for instance ?

Regards,


(Maziyar Panahi) #2

Hi @hnaacke,

Yes, you have a gateway machine that has access to HDFS and Spark. You can SSH to this machine for:

Run your Spark jobs: spark2-submit
Use interactive Spark command line: spark2-shell
Use HDFS commands to put a file from local to your HDFS or other commands: hadoop fs command (ex: hadoop fs -ls /user/hnaacke)

This machine has IP-based protection against all of its ports. In order to access this machine, I need the IP address which your workplace is using.

Simply search my ip in google, I find the range and add them to the firewall.
PS: The IP addresses must be from academic establishments (such as RENATER) not individual ADSL.

You can send me these information by private message here: https://chat.iscpif.fr


(Maziyar Panahi) #3

Hi @hnaacke @kli,

I added your IP address and its range to the Firewall. You can SSH to the machine I have already sent you.

Just a reminder, any ad-hoc or individual computations are forbidden on this machine since it is just for being a gateway and handle multi-users services. Please only use Spark for any computations.

This is an example of how Spark submit works on the cluster:

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

--master and --deploy-mode are really important.


(Hubert Naacke) #4

@mpanahi,
Is the command spark-submit installed on the gateway ?


(Maziyar Panahi) #5

The command is spark2-submit and spark2-shell but as my previous post I am in the middle of upgrading the entire cluster to the newest version of Spark 2.3.

Therefore, no command or operation is working at the moment. You can follow the operation from here:


(Maziyar Panahi) #6