Aug-16-2018, 03:00 PM
Hi there,
I'm writing a project that will run on an High Performance Computing (HPC) cluster. The general architecture of the project is pretty basic:
This looks like a perfect job for MPI. I've tried mpi4py, and it works pretty well. However, I want the last step (collect the results) to happen gradually, and as soon as a process is complete. I basically want a live-update of the results file. The function MPI.gather is therefore not suitable, as it blocks until all results are available. Playing with MPI.recv could work (I did not try, but I guess it works), but I then use 1 MPI slot and therefore 1 core just to wait for an I/O event. I could also modify the file inside the computing process, but I risk having thousands of processes trying to edit the same file at the same moment. Locking the file is fine, but the filesystem will cry.
I cannot use the built-in subprocess library either because the whole work has to be done on several nodes (opens ~2000 processes).
I've looked at the socket library, but it is low-level, and I can't imagine someone did not develop a proper package for this specific situation.
Do you have any suggestion on an adequate library for simple message passing? Or a trick to allow partial results for MPI.gather?
Thanks in advance for your help,
Duna
I'm writing a project that will run on an High Performance Computing (HPC) cluster. The general architecture of the project is pretty basic:
- Generate the arguments for each process.
- Run multiple parallel processes, each one with one set of the previously generated arguments.
- Collect the results of each process and store it in a file.
This looks like a perfect job for MPI. I've tried mpi4py, and it works pretty well. However, I want the last step (collect the results) to happen gradually, and as soon as a process is complete. I basically want a live-update of the results file. The function MPI.gather is therefore not suitable, as it blocks until all results are available. Playing with MPI.recv could work (I did not try, but I guess it works), but I then use 1 MPI slot and therefore 1 core just to wait for an I/O event. I could also modify the file inside the computing process, but I risk having thousands of processes trying to edit the same file at the same moment. Locking the file is fine, but the filesystem will cry.
I cannot use the built-in subprocess library either because the whole work has to be done on several nodes (opens ~2000 processes).
I've looked at the socket library, but it is low-level, and I can't imagine someone did not develop a proper package for this specific situation.
Do you have any suggestion on an adequate library for simple message passing? Or a trick to allow partial results for MPI.gather?
Thanks in advance for your help,
Duna