#380 closed task (duplicate)
Parallelization
Reported by: | Juergen Reuter | Owned by: | kilian, cnspeckn, trudewind |
---|---|---|---|
Priority: | P2 | Milestone: | v2.3.1 |
Component: | core | Version: | 2.0.3 |
Severity: | major | Keywords: | |
Cc: |
Description
Change History (5)
comment:1 Changed 14 years ago by
comment:2 Changed 14 years ago by
With r3104, branches/speckner/openmp_v2
is merged into the trunk. OpenMP support can now be enabled by setting ?omega_openmp = true
. This also works on a per-process basis; changing the setting for a process will cause WHIZARD to rebuild the corresponding matrix element, but will not invalidate exisiting phasespace, grids or events.
Keep in mind that you have to compile both the matrix element and WHIZARD with OpenMP flags (-fopenmp
for gfortran) for the parallelization to work correctly; enabling the flags only for the matrix element means certain doom (a corresponding warning is printed when matrix elements with OpenMP support are generated). For the moment, I have disabled the configure options referring to the previous OpenMP implementation, but the code is just commented out and still there if we choose to reactivate it in the future.
The runtime memory consumption is determined by the number of threads as the OpenMP implementation has to duplicate the wavefunctions and brakets for each threads and shouldn't be significantly higher than that of the serial version. However, compiling with -fopenmp
will cause more data to end up on the stack to ensure reentrance; this can trigger segfaults for very complicated processes which require the stack limit to be raised via ulimit -s
(happened to me once).
Documentation on this feature is still missing in the manual, but I will add it next week.
comment:3 Changed 14 years ago by
Owner: | changed from kilian to kilian, cnspeckn, trudewind |
---|
comment:4 Changed 14 years ago by
Resolution: | → duplicate |
---|---|
Status: | new → closed |
As someone has opened up another (more specialized ticket) I close this one as duplicate.
After running a couple of processes and measuring the execution time as a function of
OMP_NUM_THREADS
, I attach here the results. The tests were done on a quadcore Core2 @ 2.66GHz using mybranches/speckner/openmp_v2
OpenMP implementation. The tests demonstrate that the speedup is highly dependent on the process under consideration, ranging from a factor of nearly 3 (for four threads) to no improvement at all. While the results indicate that this piece of parallization really pays off for high multiplicity (final state >=4) processes with big flavor sums (which is good imho), they also suggest that the complexity of phasespace generation (generating points, not the maps) grows stronger than that of the matrix elements with the number of external legs. Also, for some processes, we seem to be entirely dominated by phasespace. Interestingly, this conclusion is also bolstered by monitoring the CPU time consumed by the threads intop
: for processes with a significant speedup, this is between 80% and 100% of a single core each, while for others it can be less than 10% (I am not sure about theuser
value given bytime
in the results below; I have a feeling it might be off). I think detailed profiling for different processes would be very interesting to confirm whether those assertions are indeed correct.At the moment, the parallelization is still a bit cumbersome to use: you have to build WHIZARD with
-fopenmp
(this is vital especially for O'Mega as all functions called in the parallel section have to be reentrant which is guaranteed by-fopenmp
). When running WHIZARD, intercept it immediatelly after calling O'Mega, paste the command into the terminal and append-target:openmp
and rerun WHIZARD with--recompile
. The current implementation does not have a hardcoded limit on the number of threads anymore. It doesn't hurt to check the number of threads actually running viatop
:).I think that this implementation could be merged into the trunk after I expose the functionality in a more convenient manner (i.e. a
?omega_support_openmp
- like flag in sindarin); what's your opinion on that? Also, JR, if you're interested, it would be interesting how the heavier of the processes scale with the number of threads on the mighty DESY multicore machines...OK, now for those results:
Process 1
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 2
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 3
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 4
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 5
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 6
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 7
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :
Process 8
OMP_NUM_THREADS = 4 :
OMP_NUM_THREADS = 2 :
OMP_NUM_THREADS = 1 :