whizard is hosted by Hepforge, IPPP Durham

Opened 8 years ago

Closed 8 years ago

#763 closed defect (fixed)

mci_vamp unit tests fails with gfortran 4.7.4/4.8.3 on SL6.7

Reported by: Juergen Reuter Owned by: kilian
Priority: P0 Milestone: v2.2.8
Component: core Version: 2.2.7
Severity: critical Keywords:
Cc:

Description

mci_vamp_7/8/15 unit tests fail with gfortran 4.7.4 on SL6 64bit. gfortran 4.8 and 6.0 are working, and also the gfortran 4.7.4 on Jenkins in the weekly test seemed to have been worked.

Change History (19)

comment:1 Changed 8 years ago by Juergen Reuter

This is so weird, it works on 32bit. Did a recompilation (make clean, not make distclean) on 64bit. Still the same result. mu_minus and mu_plus cannot affect the results on these tests, this here is the only code change by WK that could affect this test.:

if (g%f_min > g%f_max) then
   g%f_min = abs (f) * g%calls
   g%f_max = abs (f) * g%calls
else if (abs (f) * g%calls < g%f_min) then
   g%f_min = abs (f) * g%calls
else if (abs (f) * g%calls > g%f_max) then
   g%f_max = abs (f) * g%calls
end if

comment:2 Changed 8 years ago by Juergen Reuter

This is a Heisenbug. Tested on Jenkins with the 4.7.4 that is installed there, working. I removed the changes in the code snippet above, and it is still not working on the DESY computers. What is this?

comment:3 Changed 8 years ago by Juergen Reuter

Oha, I noticed one interesting difference, namely that the extended directory on the DESY machine (64bit) is configured with --enable-fc-openmp. Doing some testing, but most probably that is the issue.

comment:4 Changed 8 years ago by Juergen Reuter

The next infos: with OpenMP it works on Jenkins with 4.7.4, while it still fails without OpenMP on the DESY computers. Somehow I'm having the feeling that something is mis-compiled on the DESY computers? Or is it the strange behaviour of SL linkers?

comment:5 Changed 8 years ago by Juergen Reuter

The next infos: with OpenMP it works on Jenkins with 4.7.4, while it still fails without OpenMP on the DESY computers. Somehow I'm having the feeling that something is mis-compiled on the DESY computers? Or is it the strange behaviour of SL linkers?

comment:6 Changed 8 years ago by Juergen Reuter

Traced it down on the DESY computers, it clearly is the commit r7359 as expected. Any comments, ideas? I probably try to recompile gfortran 4.7.4 on SL6.7 DESY computers to check whether that could be the problem. Also will try to compile gcc 4.7.4 on MAC to see what happens here.

comment:7 Changed 8 years ago by Juergen Reuter

I cannot test gcc 4.7.4 on my Mac as the compiler doesn't compile anymore with Xcode 7.1 and clang 7.1.

comment:8 Changed 8 years ago by Juergen Reuter

I fear that if there is no resolution I will revert commit r7359. Unfortunately, this is so deep in the WHIZARD architecture, that every tiny change needs 20-30 minutes before reconfirmation.

comment:9 Changed 8 years ago by kilian

I was just inspecting the record of this ticket - any hint what is not working on those machines? I can't check 4.7 anymore, so if I had to give any input, it would have to be the test logs.

If there is no resolution, we can choose to either drop 4.7.4 support, or drop NLO support. In the current state, event generation with negative weights #259 is broken since unweighting is not supported. I've essentially fixed this (test results yesterday are encouraging), but the commit relies, of course, on r7359.

comment:10 Changed 8 years ago by Juergen Reuter

I cannot add anything at the moment, it's compiling since 25 mins. The unit test suite is again a pain in the ass if something is broken, one has to still wait for ages until everything works. Probably a binary for each unit test separately would be better than this whizard_ut and libwhizard_ut monsters.

Last edited 8 years ago by Juergen Reuter (previous) (diff)

comment:11 in reply to:  10 Changed 8 years ago by kilian

Replying to jr_reuter:

I cannot add anything at the moment, it's compiling since 25 mins. The unit test suite is again a pain in the ass if something is broken, one has to still wait for ages until everything works. Probably a binary for each unit test separately would be better than this whizard_ut and libwhizard_ut monsters.

No, that would not improve the situation because the bottleneck is the inefficient .mod file reader of gfortran. The number of times those files have to be read would not decrease if the UT executable was split. With NAG, compilation times are an order of magnitude less - but that doesn't help in the present case, of course.

comment:12 Changed 8 years ago by Juergen Reuter

Here is the change in the results:

$ diff mci_vamp_7.out ../ref-output/mci_vamp_7.ref 
61,63c61,63
<    Integral             =  9.9954598787E-01
<    Error                =  3.7884632892E-03
<    Efficiency           =  6.9167224545E-01
---
>    Integral             =  9.9977448991E-01
>    Error                =  3.7774467176E-03
>    Efficiency           =  6.9313634831E-01
100,102c100,102
<      1 1000  1.0083422237E+00  3.4719525088E-03  7.2084874023E-01
<      2 1000  9.9954598787E-01  3.7884632892E-03  6.9167224545E-01
<  MD5 sum (including results) = '7006675E60D5DB2D2E6769024911B113'
---
>      1 1000  1.0088530962E+00  3.4375735070E-03  7.2121395531E-01
>      2 1000  9.9977448991E-01  3.7774467176E-03  6.9313634831E-01
>  MD5 sum (including results) = '987DD23306FA5A6C823E08880EEF263C'

and

$ diff mci_vamp_8.out ../ref-output/mci_vamp_8.ref 
14,16c14,16
<    Integral             =  9.9878276150E-01
<    Error                =  4.2051292906E-03
<    Efficiency           =  7.7816306414E-01
---
>    Integral             =  1.0001833288E+00
>    Error                =  4.1037070130E-03
>    Efficiency           =  7.7917077207E-01
38,41c38,41
<      1 1000  1.0170493564E+00  6.9020855522E-03  5.6576519941E-01
<      2 999  9.9652068811E-01  5.4233200344E-03  7.3669900121E-01
<      3 998  9.9878276150E-01  4.2051292906E-03  7.7816306414E-01
<  MD5 sum (including results) = '4010BE579A46EDD42D23D32BBF7EADAF'
---
>      1 1000  1.0177061925E+00  6.8751470140E-03  5.6613058481E-01
>      2 999  9.9648134683E-01  5.4172752641E-03  7.3921847436E-01
>      3 998  1.0001833288E+00  4.1037070130E-03  7.7917077207E-01
>  MD5 sum (including results) = 'FC4313BD22E0F2D97A2AD0F059622529'
45,46c45,46
<    Integrand =  2.122673440110E-01
<    Weight    =  5.882328784778E+00
---
>    Integrand =  2.122712286666E-01
>    Weight    =  5.882916885089E+00
60,62c60,62
<    integral =  9.9878276150E-01
<    error    =  4.2051292906E-03
<    eff.     =  7.7816306414E-01
---
>    integral =  1.0001833288E+00
>    error    =  4.1037070130E-03
>    eff.     =  7.7917077207E-01
64,65c64,65
<      1  7.33641E-01
<      2  2.66359E-01
---
>      1  7.33668E-01
>      2  2.66332E-01
70,72c70,72
<    Integral             =  9.9806498496E-01
<    Error                =  3.4123468849E-03
<    Efficiency           =  8.2282714049E-01
---
>    Integral             =  9.9807023470E-01
>    Error                =  3.4080042818E-03
>    Efficiency           =  8.2267223019E-01
94,96c94,96
<      1 1000  1.0170493564E+00  6.9020855522E-03  5.6576519941E-01
<      2 999  9.9652068811E-01  5.4233200344E-03  7.3669900121E-01
<      3 998  9.9878276150E-01  4.2051292906E-03  7.7816306414E-01
---
>      1 1000  1.0177061925E+00  6.8751470140E-03  5.6613058481E-01
>      2 999  9.9648134683E-01  5.4172752641E-03  7.3921847436E-01
>      3 998  1.0001833288E+00  4.1037070130E-03  7.7917077207E-01
104,107c104,107
<      1 998  1.0024229883E+00  3.7564726958E-03  8.1849531743E-01
<      2 998  1.0005764562E+00  3.4557099574E-03  7.8759124509E-01
<      3 998  9.9806498496E-01  3.4123468849E-03  8.2282714049E-01
<  MD5 sum (including results) = '72007572789C6F6CD603E503355E7330'
---
>      1 998  1.0032296438E+00  3.6593386136E-03  8.2530286641E-01
>      2 998  1.0006689706E+00  3.4577861120E-03  7.8730270750E-01
>      3 998  9.9807023470E-01  3.4080042818E-03  8.2267223019E-01
>  MD5 sum (including results) = '52DFD8BF6652BA798717D5474340F2C1'
111,112c111,112
<    Integrand =  3.017442719892E-01
<    Weight    =  1.246951474788E+00
---
>    Integrand =  3.017194415775E-01
>    Weight    =  1.247184511648E+00
126,128c126,128
<    integral =  9.9806498496E-01
<    error    =  3.4123468849E-03
<    eff.     =  8.2282714049E-01
---
>    integral =  9.9807023470E-01
>    error    =  3.4080042818E-03
>    eff.     =  8.2267223019E-01
130,131c130,131
<      1  7.33641E-01
<      2  2.66359E-01
---
>      1  7.33668E-01
>      2  2.66332E-01
136,138c136,138
<    Integral             =  1.0028272784E+00
<    Error                =  3.3553855716E-03
<    Efficiency           =  8.2416912193E-01
---
>    Integral             =  1.0038188157E+00
>    Error                =  3.2285965190E-03
>    Efficiency           =  8.2482448269E-01
160,162c160,162
<      1 1000  1.0170493564E+00  6.9020855522E-03  5.6576519941E-01
<      2 999  9.9652068811E-01  5.4233200344E-03  7.3669900121E-01
<      3 998  9.9878276150E-01  4.2051292906E-03  7.7816306414E-01
---
>      1 1000  1.0177061925E+00  6.8751470140E-03  5.6613058481E-01
>      2 999  9.9648134683E-01  5.4172752641E-03  7.3921847436E-01
>      3 998  1.0001833288E+00  4.1037070130E-03  7.7917077207E-01
170,172c170,172
<      1 998  1.0024229883E+00  3.7564726958E-03  8.1849531743E-01
<      2 998  1.0005764562E+00  3.4557099574E-03  7.8759124509E-01
<      3 998  9.9806498496E-01  3.4123468849E-03  8.2282714049E-01
---
>      1 998  1.0032296438E+00  3.6593386136E-03  8.2530286641E-01
>      2 998  1.0006689706E+00  3.4577861120E-03  7.8730270750E-01
>      3 998  9.9807023470E-01  3.4080042818E-03  8.2267223019E-01
180,183c180,183
<      1 998  1.0041722704E+00  3.3719659606E-03  8.2632373322E-01
<      2 998  1.0023268375E+00  3.1610397637E-03  8.2376051973E-01
<      3 998  1.0028272784E+00  3.3553855716E-03  8.2416912193E-01
<  MD5 sum (including results) = 'D13472F0E400812ED2FD51C2F4044929'
---
>      1 998  1.0042238250E+00  3.3543815528E-03  8.2620616000E-01
>      2 998  1.0022973122E+00  3.1550223134E-03  8.2357720873E-01
>      3 998  1.0038188157E+00  3.2285965190E-03  8.2482448269E-01
>  MD5 sum (including results) = 'D2AC1DEBDEACAC93ADF5A2F2158ECAD5'
187,188c187,188
<    Integrand =  3.466447742353E-01
<    Weight    =  1.083663080227E+00
---
>    Integrand =  3.466254182468E-01
>    Weight    =  1.083835666245E+00
202,204c202,204
<    integral =  1.0028272784E+00
<    error    =  3.3553855716E-03
<    eff.     =  8.2416912193E-01
---
>    integral =  1.0038188157E+00
>    error    =  3.2285965190E-03
>    eff.     =  8.2482448269E-01
206,207c206,207
<      1  7.33641E-01
<      2  2.66359E-01
---
>      1  7.33668E-01
>      2  2.66332E-01

and

$ diff mci_vamp_15.out ../ref-output/mci_vamp_15.ref 
20,22c20,22
<    Integral             =  1.0028272784E+00
<    Error                =  3.3553855716E-03
<    Efficiency           =  8.2416912193E-01
---
>    Integral             =  1.0038188157E+00
>    Error                =  3.2285965190E-03
>    Efficiency           =  8.2482448269E-01
44,46c44,46
<      1 1000  1.0170493564E+00  6.9020855522E-03  5.6576519941E-01
<      2 999  9.9652068811E-01  5.4233200344E-03  7.3669900121E-01
<      3 998  9.9878276150E-01  4.2051292906E-03  7.7816306414E-01
---
>      1 1000  1.0177061925E+00  6.8751470140E-03  5.6613058481E-01
>      2 999  9.9648134683E-01  5.4172752641E-03  7.3921847436E-01
>      3 998  1.0001833288E+00  4.1037070130E-03  7.7917077207E-01
54,56c54,56
<      1 998  1.0024229883E+00  3.7564726958E-03  8.1849531743E-01
<      2 998  1.0005764562E+00  3.4557099574E-03  7.8759124509E-01
<      3 998  9.9806498496E-01  3.4123468849E-03  8.2282714049E-01
---
>      1 998  1.0032296438E+00  3.6593386136E-03  8.2530286641E-01
>      2 998  1.0006689706E+00  3.4577861120E-03  7.8730270750E-01
>      3 998  9.9807023470E-01  3.4080042818E-03  8.2267223019E-01
64,67c64,67
<      1 998  1.0041722704E+00  3.3719659606E-03  8.2632373322E-01
<      2 998  1.0023268375E+00  3.1610397637E-03  8.2376051973E-01
<      3 998  1.0028272784E+00  3.3553855716E-03  8.2416912193E-01
<  MD5 sum (including results) = 'D13472F0E400812ED2FD51C2F4044929'
---
>      1 998  1.0042238250E+00  3.3543815528E-03  8.2620616000E-01
>      2 998  1.0022973122E+00  3.1550223134E-03  8.2357720873E-01
>      3 998  1.0038188157E+00  3.2285965190E-03  8.2482448269E-01
>  MD5 sum (including results) = 'D2AC1DEBDEACAC93ADF5A2F2158ECAD5'
72,74c72,74
<  [vamp]    1     1000   0.0000E+00(0.00E+00)  1.017049E+00(6.90E-03)  0.0 0.000
<  [vamp]    2      999   0.0000E+00(0.00E+00)  9.965207E-01(5.42E-03)  0.0 0.000
<  [vamp]    3      998   0.0000E+00(0.00E+00)  9.987828E-01(4.21E-03)  0.0 0.000
---
>  [vamp]    1     1000   0.0000E+00(0.00E+00)  1.017706E+00(6.88E-03)  0.0 0.000
>  [vamp]    2      999   0.0000E+00(0.00E+00)  9.964813E-01(5.42E-03)  0.0 0.000
>  [vamp]    3      998   0.0000E+00(0.00E+00)  1.000183E+00(4.10E-03)  0.0 0.000
80,82c80,82
<  chan#001  1      500   1.7529E+00(7.34E-03)  1.752880E+00(7.34E-03)  0.0 0.974
<  chan#001  2      681   1.2397E+00(5.97E-03)  1.239680E+00(5.97E-03)  0.0 0.913
<  chan#001  3      732   1.1362E+00(4.19E-03)  1.136206E+00(4.19E-03)  0.0 0.881
---
>  chan#001  1      500   1.7542E+00(7.24E-03)  1.754194E+00(7.24E-03)  0.0 0.975
>  chan#001  2      681   1.2396E+00(5.96E-03)  1.239574E+00(5.96E-03)  0.0 0.918
>  chan#001  3      732   1.1381E+00(4.00E-03)  1.138094E+00(4.00E-03)  0.0 0.882
87,88c87,88
<  chan#002  2      318   4.7579E-01(1.12E-02)  4.757926E-01(1.12E-02)  0.0 0.354
<  chan#002  3      266   6.2061E-01(1.08E-02)  6.206118E-01(1.08E-02)  0.0 0.490
---
>  chan#002  2      318   4.7590E-01(1.13E-02)  4.758967E-01(1.13E-02)  0.0 0.354
>  chan#002  3      266   6.2067E-01(1.08E-02)  6.206711E-01(1.08E-02)  0.0 0.490
94,96c94,96
<  [vamp]    1      998   0.0000E+00(0.00E+00)  1.002423E+00(3.76E-03)  0.0 0.000
<  [vamp]    2      998   0.0000E+00(0.00E+00)  1.000576E+00(3.46E-03)  0.0 0.000
<  [vamp]    3      998   0.0000E+00(0.00E+00)  9.980650E-01(3.41E-03)  0.0 0.000
---
>  [vamp]    1      998   0.0000E+00(0.00E+00)  1.003230E+00(3.66E-03)  0.0 0.000
>  [vamp]    2      998   0.0000E+00(0.00E+00)  1.000669E+00(3.46E-03)  0.0 0.000
>  [vamp]    3      998   0.0000E+00(0.00E+00)  9.980702E-01(3.41E-03)  0.0 0.000
102,104c102,104
<  chan#001  1      732   1.1229E+00(3.80E-03)  1.122894E+00(3.80E-03)  0.0 0.913
<  chan#001  2      732   1.0938E+00(2.75E-03)  1.093787E+00(2.75E-03)  0.0 0.854
<  chan#001  3      732   1.0773E+00(3.62E-03)  1.077316E+00(3.62E-03)  0.0 0.886
---
>  chan#001  1      732   1.1239E+00(3.61E-03)  1.123946E+00(3.61E-03)  0.0 0.923
>  chan#001  2      732   1.0938E+00(2.75E-03)  1.093848E+00(2.75E-03)  0.0 0.854
>  chan#001  3      732   1.0773E+00(3.61E-03)  1.077268E+00(3.61E-03)  0.0 0.886
108,110c108,110
<  chan#002  1      266   6.7090E-01(9.46E-03)  6.709005E-01(9.46E-03)  0.0 0.555
<  chan#002  2      266   7.4407E-01(1.05E-02)  7.440732E-01(1.05E-02)  0.0 0.599
<  chan#002  3      266   7.7998E-01(8.04E-03)  7.799765E-01(8.04E-03)  0.0 0.647
---
>  chan#002  1      266   6.7103E-01(9.47E-03)  6.710334E-01(9.47E-03)  0.0 0.555
>  chan#002  2      266   7.4425E-01(1.05E-02)  7.442503E-01(1.05E-02)  0.0 0.599
>  chan#002  3      266   7.8013E-01(8.04E-03)  7.801289E-01(8.04E-03)  0.0 0.647
116,118c116,118
<  [vamp]    1      998   0.0000E+00(0.00E+00)  1.004172E+00(3.37E-03)  0.0 0.000
<  [vamp]    2      998   0.0000E+00(0.00E+00)  1.002327E+00(3.16E-03)  0.0 0.000
<  [vamp]    3      998   0.0000E+00(0.00E+00)  1.002827E+00(3.36E-03)  0.0 0.000
---
>  [vamp]    1      998   0.0000E+00(0.00E+00)  1.004224E+00(3.35E-03)  0.0 0.000
>  [vamp]    2      998   0.0000E+00(0.00E+00)  1.002297E+00(3.16E-03)  0.0 0.000
>  [vamp]    3      998   0.0000E+00(0.00E+00)  1.003819E+00(3.23E-03)  0.0 0.000
124,126c124,126
<  chan#001  1      732   1.0832E+00(3.54E-03)  1.083220E+00(3.54E-03)  0.0 0.890
<  chan#001  2      732   1.0852E+00(3.30E-03)  1.085204E+00(3.30E-03)  0.0 0.892
<  chan#001  3      732   1.0807E+00(3.72E-03)  1.080739E+00(3.72E-03)  0.0 0.888
---
>  chan#001  1      732   1.0832E+00(3.51E-03)  1.083234E+00(3.51E-03)  0.0 0.890
>  chan#001  2      732   1.0851E+00(3.29E-03)  1.085110E+00(3.29E-03)  0.0 0.891
>  chan#001  3      732   1.0820E+00(3.51E-03)  1.082037E+00(3.51E-03)  0.0 0.889
130,132c130,132
<  chan#002  1      266   7.8664E-01(8.08E-03)  7.866428E-01(8.08E-03)  0.0 0.650
<  chan#002  2      266   7.7426E-01(7.61E-03)  7.742583E-01(7.61E-03)  0.0 0.637
<  chan#002  3      266   7.8842E-01(7.32E-03)  7.884231E-01(7.32E-03)  0.0 0.648
---
>  chan#002  1      266   7.8680E-01(8.08E-03)  7.867974E-01(8.08E-03)  0.0 0.650
>  chan#002  2      266   7.7441E-01(7.61E-03)  7.744080E-01(7.61E-03)  0.0 0.637
>  chan#002  3      266   7.8857E-01(7.32E-03)  7.885731E-01(7.32E-03)  0.0 0.648

comment:13 Changed 8 years ago by kilian

OK, so it's not a crash, it's numerical noise. This happens only with extended prec, right?

Maybe some register/RAM storage sequence is affected by those code changes. Maybe this causes a precision loss in 4.7 but not in later versions.

That would require tedious print-statement debugging. If the problem is tied just to that particular configuration, my recommendation would be to update the footnote to manual Sec. 2.2.4 (which already mentions gfortran 4.7 series) and close the ticket as wontfix.

comment:14 Changed 8 years ago by Juergen Reuter

Probably correct. And the reason why quadruple precision is not affected is that this is working via the libquadmath implementation anyways. Unfortunately, it is not that easy, as I would lose my whole testing setup at DESY. Or it would be easy, as I would switch to Ubuntu machines and the preinstalled/-compiled 4.8.X. I will do a recompilation on the DESY system, but that old SL6.X system is anyhow a pain. Funnily it works on the 32bit machine and SL5.

comment:15 Changed 8 years ago by Juergen Reuter

AHA!!!! WK, you were right! It is not really the presence of the two components mu_plus and mu_minus in the type vamp_grid_t, but setting those components later on. Getting closer to the true start of the problem.

comment:16 Changed 8 years ago by Juergen Reuter

Summary: mci_vamp unit tests fails with gfortran 4.7.4mci_vamp unit tests fails with gfortran 4.7.4/4.8.3 on SL6.7

Actually that one is the culprit:

     g%mu_plus(2) = g%mu_plus(2) * g%dv2g
      if (g%mu_plus(2) < eps * max (g%mu_plus(1)**2, 1._default)) then
         g%mu_plus(2) = eps * max (g%mu_plus(1)**2, 1._default)
      end if
      g%mu_minus(2) = g%mu_minus(2) * g%dv2g
      if (g%mu_minus(2) < eps * max (g%mu_minus(1)**2, 1._default)) then
         g%mu_minus(2) = eps * max (g%mu_minus(1)**2, 1._default)
      end if

Moving that to the end of the subroutine, doesn't help unfortunately.

comment:17 Changed 8 years ago by kilian

That means, if you comment out those lines, the vamp tests agree (except for the new that wouldn't work then), and as given, they disagree.

There may be hidden temporaries involved. For instance, since g%dv2g is referenced several time (I think this is an array), it could be copied.

One suggestion: instead of using array notation, what about expanding in explicit loops, at least for the extra code? If the elements are accessed one at a time, there might be no temporary generated.

comment:18 Changed 8 years ago by kilian

... no, those are scalars. Forget the part about loops. It may be registers where temporaries are stored (or not).

comment:19 Changed 8 years ago by Juergen Reuter

Resolution: fixed
Status: newclosed

Solved by wrapping the culprit lines in a negative weights if-clause. Still have to recheck some of the files and all the different precisions. Closing for now.

Note: See TracTickets for help on using tickets.