Hi!
Due to my recent development (Multivac Hadoop as a Service), I had to try few solutions for multi-tenant Scientific Notebooks where users can login via LDAP and run their code (Scala, Python and R) over my Hadoop cluster (Spark/Hive/HDFS).
I’ve noticed some interesting approaches taking by the developers to overcome the major issues in multi-user and multi-tenant environment (isolation, security, user impersonation, ownership, etc.)
I can see three main areas when it comes to install/config any of these Notebooks:
- Authentication: simple (based on linux user/group), LDAP/Active Directory with over kerberos.
- Spawner: create a new process for each user, create a new server for each user, or create a new docker (managed by kubernetes)
- Connection: How to connect to different components (HDFS, Spark (Livy or Toree), or kernels like R/Python, etc.)
Now the last part can have different names in OpenMOLE (Grid, Cloud, AdHoc, etc.), but I think the rest is pretty much the same.
So I thought we can start looking at these examples and see how we can learn and adopt their solutions into OpenMOLE for having multi-user multi-tenant feature.
Apache Hue
Apache Zeppelin
JupyterHub
They all support most of the authentications (simple, LDAP/AD). They have different approaches when it comes to spawning and connecting to existing services. For instance, JupyterHub spawns a new Jupyter Notebook server for each connected user or it can spawn a new docker (needs to be installed via plugin). As the other two create new process in the current system.
Now we used Hue and Zeppelin heavily for the last 5 months. Because of LDAP integration I can control the scheduler/queue on the YARN and do resource management easily. It also allows for a better access control over HDFS ownership or Hive DB/tables access. Basically, user logs in with his/her LDAP account and everything is takin care of.
Some screenshots of how they look in my implementations:
Apache Zeppelin:
Apache Hue:
Best,
Maziyar