Cloud Incident: Multivac DSL degraded performance

cloud
maintenance
multivac-dsl

(Maziyar Panahi) #1

Hi @multivac-dsl

Due to an incident this evening, some of the machines at LAL our stopped and it cannot be started. I followed up with the supervisors and they will take a look at it tomorrow morning.

You may experience some degrading performance or not be able to finish your Spark’s jobs/tasks depending on the data locality awareness and distribution of your tasks by YARN.

Best,
Maziyar


(Maziyar Panahi) #2

I have re-created the failed instances on our Cloud and joined them to the Cluster. We are still investigating as why this happened and make sure the Hypervisors don’t kill the instances in future.

I had to restart the cluster to distribute the configurations to all the servers. You may need to re-launch your apps/jobs.