Group set of tasks in SLURM jobs

simon · 2018-11-09 17:03:54 UTC

It looks like my jobs are running on my SLURM cluster, there are some tasks failing, some errors of various types but it looks like at the end it finished (cf the picture, where I ran a simple 3 set of paremeter that finished well, and now a much bigger one which is still running)

Problem is that things that open mole is launching one task by jobs for short periods of time. Which is not very good for a cluster (and I know on our computer task with more CPU * THREAD have priority).

I saw that this problem is mentioned in the doc here and I tied but he doesn’t went very well.

Is this

model on env by 100 hook ToStringHook()

indeed means that the task sent to env will be grouped 100 by 100 and is that supposed to work with SLURM?

On another hand, at my work they developed a tool called GREASY (cf:https://github.com/BSC-Support-Team/GREASY/) that I heavily used since I’m working here without any problem.

What it does is that it take a simple text file with every task in a line and then it split all the task on the node and thread allocated by slurm. If there are more tasks than CPU x THREAD then it will wait that one tasks finish to send a new one. If at the end some tasks haven’t finish or failed, it creates a .rst file with the list of those tasks. Thus you just need to run GREASY again with this .rst tasks file instead of the original one.

This way the programmer just need to send one job to SLURM that looks like this:

#!/bin/bash

#SBATCH -o job-c1c00a56-80d3-49d6-bd1f-360c36c8f965.out
#SBATCH -e job-c1c00a56-80d3-49d6-bd1f-360c36c8f965.err
#SBATCH --nodes=4
#SBATCH --cpus-per-task=48
#SBATCH --time=00:10:00

greasy listofile.txt

with list of file being:

run_00e01f85-183e-4c46-b5c6-69311fdb85a2.sh
run_0193b1af-ed6a-4a31-aa63-1bc0484d6071.sh
run_01c2e7ab-761b-4880-990a-45abae600c26.sh
run_01c5daa6-7e92-4448-883c-c984e6158eac.sh
run_01f9cbe5-bf87-4a4c-a0c6-757aa44e3c1e.sh
run_026d709a-fe3a-4a41-ba06-031a2507bce7.sh
run_037346e1-a71a-41db-a554-c9cc972e3031.sh
...

I don’t know much about scala and I am not sure how big it is to adapt your SLURM environment to use greasy. Do you think I could try and where should I start from?

simon · 2018-11-13 16:26:11 UTC

Ok So the big experiment is not over, but it did generate an output that somehow makes sense:

yuhu!

But as i said earlier, it’s still not finish. There are no more jobs running on the cluster but the GUI still says it’s running.
What I noticed thought is that the size of remote folder $REMOTE_HOME/.openmole is still going down. I am not sure what happened/what is happening here. At some point it reached 20Go, and now it’s slowly, but really slowy, decreasing. As I write, there are still almost 7Go, 94hours after the job started. This is what the GUI is telling me right now:

Not sure what happened with the failed job and the error when I click on the GUI says this:

    java.lang.IllegalStateException: Not connected
    	at net.schmizz.sshj.SSHClient.checkConnected(SSHClient.java:801)
    	at net.schmizz.sshj.SSHClient.newSFTPClient(SSHClient.java:711)
    	at gridscale.ssh.sshj.SSHClient$$anon$1.$anonfun$peerSFTPClient$1(SSHClient.scala:111)
    	at scala.util.Try$.apply(Try.scala:209)
    	at gridscale.ssh.sshj.SSHClient$$anon$1.wrap(SSHClient.scala:104)
    	at gridscale.ssh.sshj.SSHClient$$anon$1.peerSFTPClient$lzycompute(SSHClient.scala:111)
    	at gridscale.ssh.sshj.SSHClient$$anon$1.peerSFTPClient(SSHClient.scala:111)
    	at gridscale.ssh.sshj.SSHClient$$anon$1.withClient(SSHClient.scala:115)
    	at gridscale.ssh.sshj.SSHClient$$anon$1.writeFile(SSHClient.scala:120)
    	at gridscale.ssh.package$SSH.$anonfun$writeFile$1(package.scala:111)
    	at gridscale.ssh.sshj.SSHClient$.sftp(SSHClient.scala:51)
    	at gridscale.ssh.package$SSH.write$1(package.scala:111)
    	at gridscale.ssh.package$SSH.writeFile(package.scala:116)
    	at gridscale.ssh.package$.writeFile(package.scala:305)
    	at org.openmole.plugin.environment.ssh.package$SSHStorage$$anon$1.$anonfun$upload$2(package.scala:61)
    	at org.openmole.plugin.environment.ssh.package$SSHStorage$$anon$1.$anonfun$upload$2$adapted(package.scala:61)
    	at org.openmole.plugin.environment.batch.storage.StorageInterface$.upload(StorageInterface.scala:44)
    	at org.openmole.plugin.environment.ssh.package$SSHStorage$$anon$1.$anonfun$upload$1(package.scala:61)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
    	at org.openmole.plugin.environment.batch.environment.AccessControl$.withPermit(AccessControl.scala:17)
    	at org.openmole.plugin.environment.batch.environment.AccessControl.apply(AccessControl.scala:48)
    	at org.openmole.plugin.environment.ssh.package$SSHStorage$$anon$1.upload(package.scala:61)
    	at org.openmole.plugin.environment.ssh.package$SSHStorage$$anon$1.upload(package.scala:45)
    	at org.openmole.plugin.environment.batch.storage.StorageService$.upload(StorageService.scala:55)
    	at org.openmole.plugin.environment.batch.storage.StorageService$.uploadInDirectory(StorageService.scala:64)
    	at org.openmole.plugin.environment.ssh.package$.$anonfun$submitToCluster$3(package.scala:178)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.$anonfun$toReplicatedFile$1(BatchEnvironment.scala:177)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.signalUpload(BatchEnvironment.scala:72)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.uploadReplica$1(BatchEnvironment.scala:177)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.$anonfun$toReplicatedFile$2(BatchEnvironment.scala:179)
    	at org.openmole.core.replication.ReplicaCatalog.uploadAndInsertIfNotInCatalog$1(ReplicaCatalog.scala:143)
    	at org.openmole.core.replication.ReplicaCatalog.uploadAndGetLocked(ReplicaCatalog.scala:178)
    	at org.openmole.core.replication.ReplicaCatalog.$anonfun$uploadAndGet$1(ReplicaCatalog.scala:100)
    	at org.openmole.tool.lock.LockRepository.withLock(LockRepository.scala:54)
    	at org.openmole.core.replication.ReplicaCatalog.uploadAndGet(ReplicaCatalog.scala:93)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.toReplicatedFile(BatchEnvironment.scala:179)
    	at org.openmole.plugin.environment.ssh.package$.replicate$1(package.scala:183)
    	at org.openmole.plugin.environment.ssh.package$.$anonfun$submitToCluster$6(package.scala:187)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.$anonfun$replicateTheRuntime$1(BatchEnvironment.scala:235)
    	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
    	at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
    	at scala.collection.TraversableLike.map(TraversableLike.scala:233)
    	at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
    	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.replicateTheRuntime(BatchEnvironment.scala:235)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.$anonfun$serializeJob$1(BatchEnvironment.scala:197)
    	at org.openmole.core.workspace.NewFile.withTmpFile(NewFile.scala:21)
    	at org.openmole.plugin.environment.batch.environment.BatchEnvironment$.serializeJob(BatchEnvironment.scala:188)
    	at org.openmole.plugin.environment.ssh.package$.$anonfun$submitToCluster$2(package.scala:187)
    	at org.openmole.tool.exception.package$.tryOnError(package.scala:6)
    	at org.openmole.plugin.environment.ssh.package$.submitToCluster(package.scala:175)
    	at org.openmole.plugin.environment.slurm.SLURMEnvironment$.submit(SLURMEnvironment.scala:115)
    	at org.openmole.plugin.environment.slurm.SLURMEnvironment.execute(SLURMEnvironment.scala:179)
    	at org.openmole.plugin.environment.batch.refresh.SubmitActor$.receive(SubmitActor.scala:32)
    	at org.openmole.plugin.environment.batch.refresh.JobManager$DispatcherActor$.receive(JobManager.scala:48)
    	at org.openmole.plugin.environment.batch.refresh.JobManager$.$anonfun$dispatch$1(JobManager.scala:56)
    	at org.openmole.core.threadprovider.ThreadProvider$RunClosure.run(ThreadProvider.scala:24)
    	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)

Obviously I am doing something bad and I guess I should configure my SLURM environment in another way. If someone have any idea I will be really happy to hear it!

simon · 2018-11-14 06:19:57 UTC

Just to keep the thread up to date, I stopped openmole manually as java is using lot of memory and I needed all my memory for some analysis and I manually deleted the .openmole folder on the remote computer. But the task was still running this morning.

mcwimm · 2019-08-14 07:18:01 UTC

Hi everyone,

I got the same error (java.lang.IllegalStateException: Not connected) for some of my simulations. Some of the jobs finished and some failed. Now the whole simulation stopped.
In addition, almost the same simulation is running without any problems on the cluster.
Simon, did you solve the issue or does anyone else have an idea what to do?

Best,
marie

simon · 2019-08-14 07:32:38 UTC

Hi,
It’s been a while since I haven’t looked at it but as far as I remember I haven’t find a proper way to solve that.