In 2017, we introduced the Multivac Data Science Lab, a set of tools such as interactive notebooks on top of a dedicated cloud-based Hadoop cluster to run Apache Spark jobs.
After a successful test phase, we are now thrilled to announce the release of Multivac Data Science Lab to all our partners!
Multivac DSL Success Stories
Over the past 14 months, we have used Multivac DSL at ISC-PIF for the following use cases:
Machine Learning
- Wikipedia and Web of Science topic modeling by using LDA
- Building a recommendation model based on 100 million Netflix ratings by using ALS
- Outcome prediction by Classification and Regression (Decision trees, random forests, logistic regression, and naive Bayes)
- Clustering keywords and phrases by K-means and Gaussian mixtures (GMMs)
NLP
- Implementing Stanford CoreNLP in Apache Spark for distributed NLP
- Training Universal Dependencies ML for multilingual Part of Speech detection from millions of documents
- Implementing distributed NLP pipelines for extracting keywords and phrases from large-scale English and French documents
Graph
- Politoscope community detection (100 million tweets)
- Distributed Louvain algorithm
- Community-detection for keywords and topics by using LPA and Strongly Connected Components
ETL
- Daily downloads, cleaning, extracting, and transforming 150-180 million Wikipedia page views. (total: 94 billion)
- Extracting and transforming Politoscope data for Apache Hive and Apache SQL