whizard is hosted by Hepforge, IPPP Durham

Opened 11 years ago

Closed 10 years ago

#521 closed task (wontfix)

Profiling

Reported by: kilian Owned by: kilian
Priority: P4 Milestone: v2.2.6
Component: core Version: 2.2.0beta
Severity: blocker Keywords:
Cc:

Description

After the major refactoring of 2.2 is done, we should re-examine important use cases and check the performance. If there is considerable loss against 2.1, or obvious room for improvement, we should try to optimize the code.

This ticket replaces #515.

Attachments (5)

cc10_2.2.0.vg.tgz (8.8 KB) - added by Juergen Reuter 11 years ago.
2.2.0 VAMP grids
cc10_2.1.1.vg.tgz (139.1 KB) - added by Juergen Reuter 11 years ago.
2.1.1 VAMP grids
gprof.220.gz (137.6 KB) - added by Juergen Reuter 11 years ago.
gprof analysis file (gzipped) for 2.2.0r5302.
gprof.211.gz (104.7 KB) - added by Juergen Reuter 11 years ago.
gprof analysis file for 2.1.1
analysis.tar.gz (113.3 KB) - added by Bijan Chokoufe Nejad 11 years ago.
Visualization of gprof via dot

Download all attachments as: .zip

Change History (32)

comment:1 Changed 11 years ago by Juergen Reuter

Again, when --enable-fc-openmp is set, I see 2.2.0 being very slow. Should be investigated. For

model = SM
process foo = e1, E1 => e2, E2
sqrts = 1 TeV
integrate (foo)

I get with W2.1.1 (no openmp)

real	0m2.188s
user	0m1.725s
sys	0m0.297s

and with 2.2.0 (openmp enabled)

real	0m12.947s
user	0m5.258s
sys	0m12.817s

(sic!).

comment:2 Changed 11 years ago by kilian

So this is a problem of the Openmp configuration? If both settings are identical (both w/ or both w/o), what happens?

comment:3 Changed 11 years ago by Juergen Reuter

My suggestion would be to go as fast as possible for the 2.2.0 release and leave possible profiling and optimizations for 2.2.1 etc. Opinions?

comment:4 Changed 11 years ago by Juergen Reuter

Milestone: v2.2.0v2.2.1

This is important, but will be done after the release. Re-validation comes for the release 2.2.0, but extensive profiling after that. Moving to 2.2.1.

comment:5 Changed 11 years ago by Juergen Reuter

Milestone: v2.2.1v2.2.0
Priority: P5P0
Severity: normalblocker

Ok, here is a definite example. Take LEP_cc10.sin from the examples. Compare the time estimates: 2.1.1 3s, 2.2.0 3m18s = 198 s. This is almost a factor 100!!!!!! 2.1.1:

| Generating phase space configuration ...
| ... done.
| ... found 25 phase space channels, collected in 7 groves.
| Phase space: found 25 equivalences between channels.
| Wrote phase-space configuration file 'cc10.phs'.
| iterations = 12:20000
Warning: No cuts have been defined.
| Creating VAMP integration grids:
| Using phase-space channel equivalences.
| 20000 calls, 25 channels, 8 dimensions, 20 bins, stratified = T
| Integrating process 'cc10':
|=============================================================================|
| It      Calls  Integral[fb]  Error[fb]   Err[%]    Acc  Eff[%]   Chi2 N[It] |
|=============================================================================|
   1      20000  6.5695840E+02  2.13E+01    3.24    4.58*   2.57
   2      20000  6.8501088E+02  9.12E+00    1.33    1.88*   5.85
   3      20000  6.7952599E+02  5.59E+00    0.82    1.16*   8.34
   4      20000  6.8633540E+02  3.91E+00    0.57    0.81*  12.70
   5      20000  6.8632392E+02  3.01E+00    0.44    0.62*  15.08
   6      20000  6.8424494E+02  2.46E+00    0.36    0.51*  16.95
   7      20000  6.8345396E+02  2.11E+00    0.31    0.44*  16.99
   8      20000  6.8317405E+02  1.87E+00    0.27    0.39*  16.99
   9      20000  6.8429917E+02  1.69E+00    0.25    0.35*  16.87
  10      20000  6.8427652E+02  1.55E+00    0.23    0.32*  16.07
  11      20000  6.8428985E+02  1.44E+00    0.21    0.30*  16.27
  12      20000  6.8407247E+02  1.35E+00    0.20    0.28*  16.11
|-----------------------------------------------------------------------------|
  12     240000  6.8411072E+02  6.09E-01    0.09    0.44   16.11    0.32  12
|-----------------------------------------------------------------------------|
|=============================================================================|
  12     240000  6.8411072E+02  6.09E-01    0.09    0.44   16.11    0.32  12
|=============================================================================|
| Process 'cc10':
|    time estimate for generating 10000 unweighted events: 3s

2.2.0:

| Phase space: generating configuration ...
| Phase space: ... success.
| Phase space: writing configuration file 'cc10_i1.phs'
| Phase space: 25 channels, 8 dimensions
| Phase space: found 25 channels, collected in 7 groves.
| Phase space: Using 25 equivalences between channels.
| Phase space: wood
Warning: No cuts have been defined.
| Starting integration for process 'cc10'
| Integrator: 7 chains, 25 channels, 8 dimensions
| Integrator: Using VAMP channel equivalences
| Integrator: 20000 initial calls, 20 bins, stratified = T
| Integrator: VAMP
|=============================================================================|
| It      Calls  Integral[fb]  Error[fb]   Err[%]    Acc  Eff[%]   Chi2 N[It] |
|=============================================================================|
   1      19975  7.0496249E+02  2.38E+01    3.38    4.77*   2.48
   2      19975  6.2184979E+02  2.09E+01    3.35    4.74*   2.02
   3      19975  7.0348192E+02  2.24E+01    3.18    4.49*   1.99
   4      19975  6.8583643E+02  2.22E+01    3.23    4.57    1.81
   5      19975  6.8126661E+02  2.30E+01    3.37    4.76    1.73
   6      19975  6.5915690E+02  2.21E+01    3.36    4.75*   1.62
   7      19975  6.5849735E+02  2.16E+01    3.28    4.64*   1.56
   8      19975  7.0228739E+02  2.40E+01    3.41    4.83    1.57
   9      19975  6.9059489E+02  2.34E+01    3.39    4.79*   1.52
  10      19975  6.8587185E+02  2.35E+01    3.43    4.85    1.49
  11      19975  6.9509147E+02  2.26E+01    3.25    4.59*   1.50
  12      19975  7.2011835E+02  2.54E+01    3.52    4.98    1.42
|-----------------------------------------------------------------------------|
  12     239700  6.8189659E+02  6.59E+00    0.97    4.73    1.42    1.45  12
|=============================================================================|
| Time estimate for generating 10000 events: 0d:00h:03m:18s

Phase spaces are the same up to trivial line breaks:

$ diff cc10_i1.phs ../../../../../2.1.1_test/whizard-2.1.1/build/examples/LEPcc10/cc10.phs 
1c1
<  process cc10_i1
---
>  process cc10
8,16c8,16
<    md5sum_process    = "BEAEE67601C195D7CEA5D12BFFE6F125"
<    md5sum_model_par  = "CB982AB3CBDF08043BD80FA44FE04E3B"
<    md5sum_phs_config = "5488B4D53403C6216CE17E6A2E3901EF"
<    sqrts         =  2.090000000000E+02
<    m_threshold_s =  5.000000000000E+01
<    m_threshold_t =  1.000000000000E+02
<    off_shell = 2
<    t_channel = 6
<    keep_nonresonant = T
---
>    md5sum_process    = "37D61ED7AFF25F68DB146812B470588B"
>    md5sum_model      = "A63ED9307746E1CDF8537B59D7F20C87"
>    md5sum_parameters = "3F8EADC5A4FFC4626E927A01DB137B6D"
>    sqrts         =    209.00000000000000     
>    m_threshold_s =    50.000000000000000     
>    m_threshold_t =    100.00000000000000     
>    off_shell =            2
>    t_channel =            6
>    keep_nonresonant =  T
20,22c20,21
<  ! Channel #1
<    tree  3 12 15
<    map   3 s_channel      24 ! W+
---
>  ! Channel #            1
>    tree  3 12 15   map   3 s_channel      24 ! W+
25c24
<  ! Multiplicity = 3, 1 resonance,   0 logs,  2 off-shell,  s-channel graph
---
>  ! Multiplicity = 3,  1 resonance,   0 logs,  2 off-shell,  s-channel graph
27,44c26,37
<  ! Channel #2
<    tree  3 12 15
<    map  12 s_channel      24 ! W+
<  ! Channel #3
<    tree  3 12 15
<    map   3 s_channel      24 ! W+
<  ! Channel #4
<    tree  3 7 15
<    map   3 s_channel      24 ! W+
<  ! Channel #5
<    tree  3 11 15
<    map   3 s_channel      24 ! W+
<  ! Channel #6
<    tree  12 13 15
<    map  12 s_channel      24 ! W+
<  ! Channel #7
<    tree  12 14 15
<    map  12 s_channel      24 ! W+
---
>  ! Channel #            2
>    tree  3 12 15   map  12 s_channel      24 ! W+
>  ! Channel #            3
>    tree  3 12 15   map   3 s_channel      24 ! W+
>  ! Channel #            4
>    tree  3 7 15   map   3 s_channel      24 ! W+
>  ! Channel #            5
>    tree  3 11 15   map   3 s_channel      24 ! W+
>  ! Channel #            6
>    tree  12 13 15   map  12 s_channel      24 ! W+
>  ! Channel #            7
>    tree  12 14 15   map  12 s_channel      24 ! W+
48,50c41,42
<  ! Channel #8
<    tree  3 12 15
<    map  12 s_channel      24 ! W+
---
>  ! Channel #            8
>    tree  3 12 15   map  12 s_channel      24 ! W+
52,70c44,45
<  ! Channel #9
<    tree  3 12 15
<    map   3 s_channel      24 ! W+
<    map  15 s_channel      23 ! Z
<  ! Channel #10
<    tree  3 7 15
<    map   3 s_channel      24 ! W+
<    map  15 s_channel      23 ! Z
<  ! Channel #11
<    tree  3 11 15
<    map   3 s_channel      24 ! W+
<    map  15 s_channel      23 ! Z
<  ! Channel #12
<    tree  12 13 15
<    map  12 s_channel      24 ! W+
<    map  15 s_channel      23 ! Z
<  ! Channel #13
<    tree  12 14 15
<    map  12 s_channel      24 ! W+
---
>  ! Channel #            9
>    tree  3 12 15   map   3 s_channel      24 ! W+
72,76c47,48
< 
<  ! Multiplicity = 1, 1 resonance,   0 logs,  0 off-shell,  s-channel graph
<  grove #4
<  ! Channel #14
<    tree  3 7 15
---
>  ! Channel #           10
>    tree  3 7 15   map   3 s_channel      24 ! W+
78,79c50,51
<  ! Channel #15
<    tree  3 11 15
---
>  ! Channel #           11
>    tree  3 11 15   map   3 s_channel      24 ! W+
81,82c53,54
<  ! Channel #16
<    tree  12 13 15
---
>  ! Channel #           12
>    tree  12 13 15   map  12 s_channel      24 ! W+
84,85c56,57
<  ! Channel #17
<    tree  12 14 15
---
>  ! Channel #           13
>    tree  12 14 15   map  12 s_channel      24 ! W+
88c60,71
<  ! Multiplicity = 2, 2 resonances,  1 logs,  0 off-shell,  1 t-channel line
---
>  ! Multiplicity = 1,  1 resonance,   0 logs,  0 off-shell,  s-channel graph
>  grove #4
>  ! Channel #           14
>    tree  3 7 15   map  15 s_channel      23 ! Z
>  ! Channel #           15
>    tree  3 11 15   map  15 s_channel      23 ! Z
>  ! Channel #           16
>    tree  12 13 15   map  15 s_channel      23 ! Z
>  ! Channel #           17
>    tree  12 14 15   map  15 s_channel      23 ! Z
> 
>  ! Multiplicity = 2, 2 resonances,  1 logs,  0 off-shell,   1 t-channel line
90,92c73,74
<  ! Channel #18
<    tree  3 12 28
<    map   3 s_channel      24 ! W+
---
>  ! Channel #           18
>    tree  3 12 28   map   3 s_channel      24 ! W+
95,97c77,78
<  ! Channel #19
<    tree  3 12 35
<    map   3 s_channel      24 ! W+
---
>  ! Channel #           19
>    tree  3 12 35   map   3 s_channel      24 ! W+
101c82
<  ! Multiplicity = 3, 1 resonance,   1 logs,  1 off-shell,  1 t-channel line
---
>  ! Multiplicity = 3,  1 resonance,   1 logs,  1 off-shell,   1 t-channel line
103,105c84,85
<  ! Channel #20
<    tree  3 12 28
<    map  12 s_channel      24 ! W+
---
>  ! Channel #           20
>    tree  3 12 28   map  12 s_channel      24 ! W+
107,109c87,88
<  ! Channel #21
<    tree  3 12 28
<    map   3 s_channel      24 ! W+
---
>  ! Channel #           21
>    tree  3 12 28   map   3 s_channel      24 ! W+
111,113c90,91
<  ! Channel #22
<    tree  3 12 35
<    map   3 s_channel      24 ! W+
---
>  ! Channel #           22
>    tree  3 12 35   map   3 s_channel      24 ! W+
115,117c93,94
<  ! Channel #23
<    tree  3 12 35
<    map  12 s_channel      24 ! W+
---
>  ! Channel #           23
>    tree  3 12 35   map  12 s_channel      24 ! W+
120c97
<  ! Multiplicity = 4, no resonances,  1 logs,  2 off-shell,  1 t-channel line
---
>  ! Multiplicity = 4, no resonances,  1 logs,  2 off-shell,   1 t-channel line
122,127c99,102
<  ! Channel #24
<    tree  3 12 28
<    map  28 t_channel      12 ! nue
<  ! Channel #25
<    tree  3 12 35
<    map  35 t_channel      12 ! nue
---
>  ! Channel #           24
>    tree  3 12 28   map  28 t_channel      12 ! nue
>  ! Channel #           25
>    tree  3 12 35   map  35 t_channel      12 ! nue

I attached the VAMP grids, and they are hilariously different!

Changed 11 years ago by Juergen Reuter

Attachment: cc10_2.2.0.vg.tgz added

2.2.0 VAMP grids

Changed 11 years ago by Juergen Reuter

Attachment: cc10_2.1.1.vg.tgz added

2.1.1 VAMP grids

comment:6 Changed 11 years ago by Juergen Reuter

One apparent difference: in the 2.1.1 version of the grids, there are three columns for d%x while there is only one for the 2.2.0 version.

comment:7 Changed 11 years ago by kilian

The bug is in the example script:

Change

 iterations = 12:20000

to, e.g.

 iterations = 12:20000:"gw"

We agreed in a recent meeting that in the last integration pass, VAMP by default should not adapt grids or weights anymore. Otherwise, the iterations are correlated and the error estimate is incorrect. This is implemented in 2.2.0.

So, either include a warmup pass in the iterations, or explicitly state that grids and weights should be adapted.

When I make this change, and set identical compilation flags for both 2.1.1 and 2.2.0, the time is the same.

comment:8 Changed 11 years ago by kilian

... please x-check.

happened, that my 2.1.1 installation doesn't work. But I do get a 3s time estimate with 2.2.0, on my laptop.

comment:9 Changed 11 years ago by Juergen Reuter

Ok, with that I get 22s, but 2.2.0 still is a _LOT_ slower than 2.1.1: running the same process on my laptop: 2.1.1:

real 	0m16.019s
user	0m15.319s
sys	0m0.451s

2.2.0:

real	1m47.599s
user	0m58.758s
sys	1m53.466s

So this is a factor 6-7. :((( What the hell happened?

comment:10 Changed 11 years ago by Juergen Reuter

The result for 2.1.1 is without OpenMP actually. Setting the threads to 1 didn't change ANYTHING. Is it the helicity selection rules? what can it be? trying to compile without OpenMP now. Desastrous.

comment:11 Changed 11 years ago by Juergen Reuter

Switching the helicity selection off by ?helicity_selection_active = false does only increase the user time:

real	1m47.507s
user	1m40.928s
sys	1m47.482s

???

comment:12 Changed 11 years ago by Juergen Reuter

Ok, the effect is also there for 2.1.1, w\ OpenMP and helicity selection:

real	0m45.592s
user	0m44.896s
sys	0m0.467s

comment:13 Changed 11 years ago by kilian

Priority: P0P2
Status: newassigned

So, if openmp is turned off, this appears to be a factor 1-3 ...?

I'll do some proper profiling with static executables, after the new bugs (#597, #598) have been fixed. Independently, we have to carefully check whether openmp settings are respected as they should.

Version 1, edited 11 years ago by Juergen Reuter (previous) (next) (diff)

comment:14 Changed 11 years ago by Juergen Reuter

I ran the CC10 example with the following files;

model = SM
process cc10 = e1, E1 => e2, N2, u, D
compile as "blubb.exe" ()

and then

./blubb.exe cc10.sin

with the file cc10.sin as

model = SM
library = "blubb_lib"
show (library)
process cc10 = e1, E1 => e2, N2, u, D
compile 
show (library)
# We set the muon mass to zero.
mmu = 0
sqrts = 209 GeV
integrate (cc10) { iterations = 12:20000:"gw" }
n_events = 50000
simulate (cc10) 

The gprof analysis files for 2.1.1 and 2.2.0r5302 are attached.

Changed 11 years ago by Juergen Reuter

Attachment: gprof.220.gz added

gprof analysis file (gzipped) for 2.2.0r5302.

Changed 11 years ago by Juergen Reuter

Attachment: gprof.211.gz added

gprof analysis file for 2.1.1

comment:15 Changed 11 years ago by Juergen Reuter

Priority: P2P0

When comparing the examples, I figured that 2.2 is really awfully slow. Should be one of the top prios.

Changed 11 years ago by Bijan Chokoufe Nejad

Attachment: analysis.tar.gz added

Visualization of gprof via dot

comment:16 Changed 11 years ago by Bijan Chokoufe Nejad

Given the gprof files, I have visualized the profiler output via http://code.google.com/p/jrfonseca/wiki/Gprof2Dot and attached the resulting pdfs. Obviously, the helicity selection rules seem to be working fine as the number of matrix element evaluations is even slightly reduced.

                                   2.2.0                   2.1.1
calculate_amplitudes                  434 421 (19.90 %)       445 344 (25.15 %)
phs_tree_compute_x_from_momenta    13 467 051 (21.21 %)    13 805 664 (31.08 %)
process_compute_vamp_phs_factor       457 688 (19.60 %)       445 344 (21.79 %)
== mci_vamp_instance_compute_weight

One can see that these three big blocks from 2.1.1 add up to only 60.71 % in 2.2.0 instead of 78.02 % despite comparable functions calls. Whizard spends now more time in functions which are not in the red marked boxes in analysis220.pdf.

comment:17 Changed 11 years ago by kilian

Got some preliminary results: yes, I see a few redundancies, none of them dramatic, but they appear to add up. Investigating further ...

comment:18 Changed 11 years ago by kilian

Priority: P0P1

Up to r5416, the main sources of this discrepancy should be eliminated. There is still some overhead, but this should scale linearly. (I see some obvious improvements there, so I keep this ticket but can rank it down.)

comment:19 Changed 11 years ago by kilian

New task in this context: rewrite the function phs_forest_evaluate_other_channels, such that (a) OpenMP works correctly, and (b) the funny code in the first loop is eliminated (trivial).

comment:20 Changed 11 years ago by Juergen Reuter

HERA_dis.sin is an example where by eye one can see the slowness of 2.2 compared to 2.1.

comment:21 in reply to:  20 Changed 11 years ago by kilian

Priority: P1P4

Replying to jr_reuter:

HERA_dis.sin is an example where by eye one can see the slowness of 2.2 compared to 2.1.

No. In my test (gfortran 4.8.0, -g -O2, OpenMP off), 2.2.0 is actually slightly faster. The "wg" flags were missing in the integration command, again.

I write a separate ticket for those flags (#622).

comment:22 Changed 11 years ago by Juergen Reuter

Milestone: v2.2.0v2.2.1
Version: 2.1.12.2.0beta

After discussion with WK: done sufficiently enough for a X.X.0 release. More later.

comment:23 Changed 11 years ago by Juergen Reuter

Milestone: v2.2.2v2.2.3

comment:24 Changed 10 years ago by Juergen Reuter

Milestone: v2.2.3v2.2.4

Is there something to be done at the moment, or should we react to particular user demands!?

comment:25 Changed 10 years ago by kilian

That's more like a golden classics ...

Profiling the whole thing from time to time might reveal surprises.

comment:26 Changed 10 years ago by Juergen Reuter

Milestone: v2.2.5v2.2.6

comment:27 Changed 10 years ago by Juergen Reuter

Resolution: wontfix
Status: assignedclosed

Not relevant at the moment.

Note: See TracTickets for help on using tickets.