Opened 11 years ago
Closed 10 years ago
#521 closed task (wontfix)
Profiling
Reported by: | kilian | Owned by: | kilian |
---|---|---|---|
Priority: | P4 | Milestone: | v2.2.6 |
Component: | core | Version: | 2.2.0beta |
Severity: | blocker | Keywords: | |
Cc: |
Description
After the major refactoring of 2.2 is done, we should re-examine important use cases and check the performance. If there is considerable loss against 2.1, or obvious room for improvement, we should try to optimize the code.
This ticket replaces #515.
Attachments (5)
Change History (32)
comment:1 Changed 11 years ago by
comment:2 Changed 11 years ago by
So this is a problem of the Openmp configuration? If both settings are identical (both w/ or both w/o), what happens?
comment:3 Changed 11 years ago by
My suggestion would be to go as fast as possible for the 2.2.0 release and leave possible profiling and optimizations for 2.2.1 etc. Opinions?
comment:4 Changed 11 years ago by
Milestone: | v2.2.0 → v2.2.1 |
---|
This is important, but will be done after the release. Re-validation comes for the release 2.2.0, but extensive profiling after that. Moving to 2.2.1.
comment:5 Changed 11 years ago by
Milestone: | v2.2.1 → v2.2.0 |
---|---|
Priority: | P5 → P0 |
Severity: | normal → blocker |
Ok, here is a definite example. Take LEP_cc10.sin from the examples. Compare the time estimates: 2.1.1 3s, 2.2.0 3m18s = 198 s. This is almost a factor 100!!!!!! 2.1.1:
| Generating phase space configuration ... | ... done. | ... found 25 phase space channels, collected in 7 groves. | Phase space: found 25 equivalences between channels. | Wrote phase-space configuration file 'cc10.phs'. | iterations = 12:20000 Warning: No cuts have been defined. | Creating VAMP integration grids: | Using phase-space channel equivalences. | 20000 calls, 25 channels, 8 dimensions, 20 bins, stratified = T | Integrating process 'cc10': |=============================================================================| | It Calls Integral[fb] Error[fb] Err[%] Acc Eff[%] Chi2 N[It] | |=============================================================================| 1 20000 6.5695840E+02 2.13E+01 3.24 4.58* 2.57 2 20000 6.8501088E+02 9.12E+00 1.33 1.88* 5.85 3 20000 6.7952599E+02 5.59E+00 0.82 1.16* 8.34 4 20000 6.8633540E+02 3.91E+00 0.57 0.81* 12.70 5 20000 6.8632392E+02 3.01E+00 0.44 0.62* 15.08 6 20000 6.8424494E+02 2.46E+00 0.36 0.51* 16.95 7 20000 6.8345396E+02 2.11E+00 0.31 0.44* 16.99 8 20000 6.8317405E+02 1.87E+00 0.27 0.39* 16.99 9 20000 6.8429917E+02 1.69E+00 0.25 0.35* 16.87 10 20000 6.8427652E+02 1.55E+00 0.23 0.32* 16.07 11 20000 6.8428985E+02 1.44E+00 0.21 0.30* 16.27 12 20000 6.8407247E+02 1.35E+00 0.20 0.28* 16.11 |-----------------------------------------------------------------------------| 12 240000 6.8411072E+02 6.09E-01 0.09 0.44 16.11 0.32 12 |-----------------------------------------------------------------------------| |=============================================================================| 12 240000 6.8411072E+02 6.09E-01 0.09 0.44 16.11 0.32 12 |=============================================================================| | Process 'cc10': | time estimate for generating 10000 unweighted events: 3s
2.2.0:
| Phase space: generating configuration ... | Phase space: ... success. | Phase space: writing configuration file 'cc10_i1.phs' | Phase space: 25 channels, 8 dimensions | Phase space: found 25 channels, collected in 7 groves. | Phase space: Using 25 equivalences between channels. | Phase space: wood Warning: No cuts have been defined. | Starting integration for process 'cc10' | Integrator: 7 chains, 25 channels, 8 dimensions | Integrator: Using VAMP channel equivalences | Integrator: 20000 initial calls, 20 bins, stratified = T | Integrator: VAMP |=============================================================================| | It Calls Integral[fb] Error[fb] Err[%] Acc Eff[%] Chi2 N[It] | |=============================================================================| 1 19975 7.0496249E+02 2.38E+01 3.38 4.77* 2.48 2 19975 6.2184979E+02 2.09E+01 3.35 4.74* 2.02 3 19975 7.0348192E+02 2.24E+01 3.18 4.49* 1.99 4 19975 6.8583643E+02 2.22E+01 3.23 4.57 1.81 5 19975 6.8126661E+02 2.30E+01 3.37 4.76 1.73 6 19975 6.5915690E+02 2.21E+01 3.36 4.75* 1.62 7 19975 6.5849735E+02 2.16E+01 3.28 4.64* 1.56 8 19975 7.0228739E+02 2.40E+01 3.41 4.83 1.57 9 19975 6.9059489E+02 2.34E+01 3.39 4.79* 1.52 10 19975 6.8587185E+02 2.35E+01 3.43 4.85 1.49 11 19975 6.9509147E+02 2.26E+01 3.25 4.59* 1.50 12 19975 7.2011835E+02 2.54E+01 3.52 4.98 1.42 |-----------------------------------------------------------------------------| 12 239700 6.8189659E+02 6.59E+00 0.97 4.73 1.42 1.45 12 |=============================================================================| | Time estimate for generating 10000 events: 0d:00h:03m:18s
Phase spaces are the same up to trivial line breaks:
$ diff cc10_i1.phs ../../../../../2.1.1_test/whizard-2.1.1/build/examples/LEPcc10/cc10.phs 1c1 < process cc10_i1 --- > process cc10 8,16c8,16 < md5sum_process = "BEAEE67601C195D7CEA5D12BFFE6F125" < md5sum_model_par = "CB982AB3CBDF08043BD80FA44FE04E3B" < md5sum_phs_config = "5488B4D53403C6216CE17E6A2E3901EF" < sqrts = 2.090000000000E+02 < m_threshold_s = 5.000000000000E+01 < m_threshold_t = 1.000000000000E+02 < off_shell = 2 < t_channel = 6 < keep_nonresonant = T --- > md5sum_process = "37D61ED7AFF25F68DB146812B470588B" > md5sum_model = "A63ED9307746E1CDF8537B59D7F20C87" > md5sum_parameters = "3F8EADC5A4FFC4626E927A01DB137B6D" > sqrts = 209.00000000000000 > m_threshold_s = 50.000000000000000 > m_threshold_t = 100.00000000000000 > off_shell = 2 > t_channel = 6 > keep_nonresonant = T 20,22c20,21 < ! Channel #1 < tree 3 12 15 < map 3 s_channel 24 ! W+ --- > ! Channel # 1 > tree 3 12 15 map 3 s_channel 24 ! W+ 25c24 < ! Multiplicity = 3, 1 resonance, 0 logs, 2 off-shell, s-channel graph --- > ! Multiplicity = 3, 1 resonance, 0 logs, 2 off-shell, s-channel graph 27,44c26,37 < ! Channel #2 < tree 3 12 15 < map 12 s_channel 24 ! W+ < ! Channel #3 < tree 3 12 15 < map 3 s_channel 24 ! W+ < ! Channel #4 < tree 3 7 15 < map 3 s_channel 24 ! W+ < ! Channel #5 < tree 3 11 15 < map 3 s_channel 24 ! W+ < ! Channel #6 < tree 12 13 15 < map 12 s_channel 24 ! W+ < ! Channel #7 < tree 12 14 15 < map 12 s_channel 24 ! W+ --- > ! Channel # 2 > tree 3 12 15 map 12 s_channel 24 ! W+ > ! Channel # 3 > tree 3 12 15 map 3 s_channel 24 ! W+ > ! Channel # 4 > tree 3 7 15 map 3 s_channel 24 ! W+ > ! Channel # 5 > tree 3 11 15 map 3 s_channel 24 ! W+ > ! Channel # 6 > tree 12 13 15 map 12 s_channel 24 ! W+ > ! Channel # 7 > tree 12 14 15 map 12 s_channel 24 ! W+ 48,50c41,42 < ! Channel #8 < tree 3 12 15 < map 12 s_channel 24 ! W+ --- > ! Channel # 8 > tree 3 12 15 map 12 s_channel 24 ! W+ 52,70c44,45 < ! Channel #9 < tree 3 12 15 < map 3 s_channel 24 ! W+ < map 15 s_channel 23 ! Z < ! Channel #10 < tree 3 7 15 < map 3 s_channel 24 ! W+ < map 15 s_channel 23 ! Z < ! Channel #11 < tree 3 11 15 < map 3 s_channel 24 ! W+ < map 15 s_channel 23 ! Z < ! Channel #12 < tree 12 13 15 < map 12 s_channel 24 ! W+ < map 15 s_channel 23 ! Z < ! Channel #13 < tree 12 14 15 < map 12 s_channel 24 ! W+ --- > ! Channel # 9 > tree 3 12 15 map 3 s_channel 24 ! W+ 72,76c47,48 < < ! Multiplicity = 1, 1 resonance, 0 logs, 0 off-shell, s-channel graph < grove #4 < ! Channel #14 < tree 3 7 15 --- > ! Channel # 10 > tree 3 7 15 map 3 s_channel 24 ! W+ 78,79c50,51 < ! Channel #15 < tree 3 11 15 --- > ! Channel # 11 > tree 3 11 15 map 3 s_channel 24 ! W+ 81,82c53,54 < ! Channel #16 < tree 12 13 15 --- > ! Channel # 12 > tree 12 13 15 map 12 s_channel 24 ! W+ 84,85c56,57 < ! Channel #17 < tree 12 14 15 --- > ! Channel # 13 > tree 12 14 15 map 12 s_channel 24 ! W+ 88c60,71 < ! Multiplicity = 2, 2 resonances, 1 logs, 0 off-shell, 1 t-channel line --- > ! Multiplicity = 1, 1 resonance, 0 logs, 0 off-shell, s-channel graph > grove #4 > ! Channel # 14 > tree 3 7 15 map 15 s_channel 23 ! Z > ! Channel # 15 > tree 3 11 15 map 15 s_channel 23 ! Z > ! Channel # 16 > tree 12 13 15 map 15 s_channel 23 ! Z > ! Channel # 17 > tree 12 14 15 map 15 s_channel 23 ! Z > > ! Multiplicity = 2, 2 resonances, 1 logs, 0 off-shell, 1 t-channel line 90,92c73,74 < ! Channel #18 < tree 3 12 28 < map 3 s_channel 24 ! W+ --- > ! Channel # 18 > tree 3 12 28 map 3 s_channel 24 ! W+ 95,97c77,78 < ! Channel #19 < tree 3 12 35 < map 3 s_channel 24 ! W+ --- > ! Channel # 19 > tree 3 12 35 map 3 s_channel 24 ! W+ 101c82 < ! Multiplicity = 3, 1 resonance, 1 logs, 1 off-shell, 1 t-channel line --- > ! Multiplicity = 3, 1 resonance, 1 logs, 1 off-shell, 1 t-channel line 103,105c84,85 < ! Channel #20 < tree 3 12 28 < map 12 s_channel 24 ! W+ --- > ! Channel # 20 > tree 3 12 28 map 12 s_channel 24 ! W+ 107,109c87,88 < ! Channel #21 < tree 3 12 28 < map 3 s_channel 24 ! W+ --- > ! Channel # 21 > tree 3 12 28 map 3 s_channel 24 ! W+ 111,113c90,91 < ! Channel #22 < tree 3 12 35 < map 3 s_channel 24 ! W+ --- > ! Channel # 22 > tree 3 12 35 map 3 s_channel 24 ! W+ 115,117c93,94 < ! Channel #23 < tree 3 12 35 < map 12 s_channel 24 ! W+ --- > ! Channel # 23 > tree 3 12 35 map 12 s_channel 24 ! W+ 120c97 < ! Multiplicity = 4, no resonances, 1 logs, 2 off-shell, 1 t-channel line --- > ! Multiplicity = 4, no resonances, 1 logs, 2 off-shell, 1 t-channel line 122,127c99,102 < ! Channel #24 < tree 3 12 28 < map 28 t_channel 12 ! nue < ! Channel #25 < tree 3 12 35 < map 35 t_channel 12 ! nue --- > ! Channel # 24 > tree 3 12 28 map 28 t_channel 12 ! nue > ! Channel # 25 > tree 3 12 35 map 35 t_channel 12 ! nue
I attached the VAMP grids, and they are hilariously different!
comment:6 Changed 11 years ago by
One apparent difference: in the 2.1.1 version of the grids, there are three columns for d%x while there is only one for the 2.2.0 version.
comment:7 Changed 11 years ago by
The bug is in the example script:
Change
iterations = 12:20000
to, e.g.
iterations = 12:20000:"gw"
We agreed in a recent meeting that in the last integration pass, VAMP by default should not adapt grids or weights anymore. Otherwise, the iterations are correlated and the error estimate is incorrect. This is implemented in 2.2.0.
So, either include a warmup pass in the iterations, or explicitly state that grids and weights should be adapted.
When I make this change, and set identical compilation flags for both 2.1.1 and 2.2.0, the time is the same.
comment:8 Changed 11 years ago by
... please x-check.
happened, that my 2.1.1 installation doesn't work. But I do get a 3s time estimate with 2.2.0, on my laptop.
comment:9 Changed 11 years ago by
Ok, with that I get 22s, but 2.2.0 still is a _LOT_ slower than 2.1.1: running the same process on my laptop: 2.1.1:
real 0m16.019s user 0m15.319s sys 0m0.451s
2.2.0:
real 1m47.599s user 0m58.758s sys 1m53.466s
So this is a factor 6-7. :((( What the hell happened?
comment:10 Changed 11 years ago by
The result for 2.1.1 is without OpenMP actually. Setting the threads to 1 didn't change ANYTHING. Is it the helicity selection rules? what can it be? trying to compile without OpenMP now. Desastrous.
comment:11 Changed 11 years ago by
Switching the helicity selection off by ?helicity_selection_active = false does only increase the user time:
real 1m47.507s user 1m40.928s sys 1m47.482s
???
comment:12 Changed 11 years ago by
Ok, the effect is also there for 2.1.1, w\ OpenMP and helicity selection:
real 0m45.592s user 0m44.896s sys 0m0.467s
comment:13 Changed 11 years ago by
Priority: | P0 → P2 |
---|---|
Status: | new → assigned |
So, if openmp is turned off, this appears to be a factor 1-3 ...?
I'll do some proper profiling with static executables, after the new bugs (#597, #599) have been fixed. Independently, we have to carefully check whether openmp settings are respected as they should.
comment:14 Changed 11 years ago by
I ran the CC10 example with the following files;
model = SM process cc10 = e1, E1 => e2, N2, u, D compile as "blubb.exe" ()
and then
./blubb.exe cc10.sin
with the file cc10.sin
as
model = SM library = "blubb_lib" show (library) process cc10 = e1, E1 => e2, N2, u, D compile show (library) # We set the muon mass to zero. mmu = 0 sqrts = 209 GeV integrate (cc10) { iterations = 12:20000:"gw" } n_events = 50000 simulate (cc10)
The gprof analysis files for 2.1.1 and 2.2.0r5302 are attached.
Changed 11 years ago by
Attachment: | gprof.220.gz added |
---|
gprof analysis file (gzipped) for 2.2.0r5302.
comment:15 Changed 11 years ago by
Priority: | P2 → P0 |
---|
When comparing the examples, I figured that 2.2 is really awfully slow. Should be one of the top prios.
comment:16 Changed 11 years ago by
Given the gprof files, I have visualized the profiler output via http://code.google.com/p/jrfonseca/wiki/Gprof2Dot and attached the resulting pdfs. Obviously, the helicity selection rules seem to be working fine as the number of matrix element evaluations is even slightly reduced.
2.2.0 2.1.1 calculate_amplitudes 434 421 (19.90 %) 445 344 (25.15 %) phs_tree_compute_x_from_momenta 13 467 051 (21.21 %) 13 805 664 (31.08 %) process_compute_vamp_phs_factor 457 688 (19.60 %) 445 344 (21.79 %) == mci_vamp_instance_compute_weight
One can see that these three big blocks from 2.1.1 add up to only 60.71 % in 2.2.0 instead of 78.02 % despite comparable functions calls. Whizard spends now more time in functions which are not in the red marked boxes in analysis220.pdf
.
comment:17 Changed 11 years ago by
Got some preliminary results: yes, I see a few redundancies, none of them dramatic, but they appear to add up. Investigating further ...
comment:18 Changed 11 years ago by
Priority: | P0 → P1 |
---|
Up to r5416, the main sources of this discrepancy should be eliminated. There is still some overhead, but this should scale linearly. (I see some obvious improvements there, so I keep this ticket but can rank it down.)
comment:19 Changed 11 years ago by
New task in this context: rewrite the function phs_forest_evaluate_other_channels, such that (a) OpenMP works correctly, and (b) the funny code in the first loop is eliminated (trivial).
comment:20 follow-up: 21 Changed 11 years ago by
HERA_dis.sin is an example where by eye one can see the slowness of 2.2 compared to 2.1.
comment:21 Changed 11 years ago by
Priority: | P1 → P4 |
---|
Replying to jr_reuter:
HERA_dis.sin is an example where by eye one can see the slowness of 2.2 compared to 2.1.
No. In my test (gfortran 4.8.0, -g -O2, OpenMP off), 2.2.0 is actually slightly faster. The "wg" flags were missing in the integration command, again.
I write a separate ticket for those flags (#622).
comment:22 Changed 11 years ago by
Milestone: | v2.2.0 → v2.2.1 |
---|---|
Version: | 2.1.1 → 2.2.0beta |
After discussion with WK: done sufficiently enough for a X.X.0 release. More later.
comment:23 Changed 11 years ago by
Milestone: | v2.2.2 → v2.2.3 |
---|
comment:24 Changed 10 years ago by
Milestone: | v2.2.3 → v2.2.4 |
---|
Is there something to be done at the moment, or should we react to particular user demands!?
comment:25 Changed 10 years ago by
That's more like a golden classics ...
Profiling the whole thing from time to time might reveal surprises.
comment:26 Changed 10 years ago by
Milestone: | v2.2.5 → v2.2.6 |
---|
comment:27 Changed 10 years ago by
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
Not relevant at the moment.
Again, when --enable-fc-openmp is set, I see 2.2.0 being very slow. Should be investigated. For
I get with W2.1.1 (no openmp)
and with 2.2.0 (openmp enabled)
(sic!).