Opened 8 years ago
Closed 8 years ago
#763 closed defect (fixed)
mci_vamp unit tests fails with gfortran 4.7.4/4.8.3 on SL6.7
Reported by: | Juergen Reuter | Owned by: | kilian |
---|---|---|---|
Priority: | P0 | Milestone: | v2.2.8 |
Component: | core | Version: | 2.2.7 |
Severity: | critical | Keywords: | |
Cc: |
Description
mci_vamp_7/8/15 unit tests fail with gfortran 4.7.4 on SL6 64bit. gfortran 4.8 and 6.0 are working, and also the gfortran 4.7.4 on Jenkins in the weekly test seemed to have been worked.
Change History (19)
comment:1 Changed 8 years ago by
comment:2 Changed 8 years ago by
This is a Heisenbug. Tested on Jenkins with the 4.7.4 that is installed there, working. I removed the changes in the code snippet above, and it is still not working on the DESY computers. What is this?
comment:3 Changed 8 years ago by
Oha, I noticed one interesting difference, namely that the extended directory on the DESY machine (64bit) is configured with --enable-fc-openmp. Doing some testing, but most probably that is the issue.
comment:4 Changed 8 years ago by
The next infos: with OpenMP it works on Jenkins with 4.7.4, while it still fails without OpenMP on the DESY computers. Somehow I'm having the feeling that something is mis-compiled on the DESY computers? Or is it the strange behaviour of SL linkers?
comment:5 Changed 8 years ago by
The next infos: with OpenMP it works on Jenkins with 4.7.4, while it still fails without OpenMP on the DESY computers. Somehow I'm having the feeling that something is mis-compiled on the DESY computers? Or is it the strange behaviour of SL linkers?
comment:6 Changed 8 years ago by
Traced it down on the DESY computers, it clearly is the commit r7359 as expected. Any comments, ideas? I probably try to recompile gfortran 4.7.4 on SL6.7 DESY computers to check whether that could be the problem. Also will try to compile gcc 4.7.4 on MAC to see what happens here.
comment:7 Changed 8 years ago by
I cannot test gcc 4.7.4 on my Mac as the compiler doesn't compile anymore with Xcode 7.1 and clang 7.1.
comment:8 Changed 8 years ago by
I fear that if there is no resolution I will revert commit r7359. Unfortunately, this is so deep in the WHIZARD architecture, that every tiny change needs 20-30 minutes before reconfirmation.
comment:9 Changed 8 years ago by
I was just inspecting the record of this ticket - any hint what is not working on those machines? I can't check 4.7 anymore, so if I had to give any input, it would have to be the test logs.
If there is no resolution, we can choose to either drop 4.7.4 support, or drop NLO support. In the current state, event generation with negative weights #259 is broken since unweighting is not supported. I've essentially fixed this (test results yesterday are encouraging), but the commit relies, of course, on r7359.
comment:10 follow-up: 11 Changed 8 years ago by
I cannot add anything at the moment, it's compiling since 25 mins. The unit test suite is again a pain in the ass if something is broken, one has to still wait for ages until everything works. Probably a binary for each unit test separately would be better than this whizard_ut and libwhizard_ut monsters.
comment:11 Changed 8 years ago by
Replying to jr_reuter:
I cannot add anything at the moment, it's compiling since 25 mins. The unit test suite is again a pain in the ass if something is broken, one has to still wait for ages until everything works. Probably a binary for each unit test separately would be better than this whizard_ut and libwhizard_ut monsters.
No, that would not improve the situation because the bottleneck is the inefficient .mod file reader of gfortran. The number of times those files have to be read would not decrease if the UT executable was split. With NAG, compilation times are an order of magnitude less - but that doesn't help in the present case, of course.
comment:12 Changed 8 years ago by
Here is the change in the results:
$ diff mci_vamp_7.out ../ref-output/mci_vamp_7.ref 61,63c61,63 < Integral = 9.9954598787E-01 < Error = 3.7884632892E-03 < Efficiency = 6.9167224545E-01 --- > Integral = 9.9977448991E-01 > Error = 3.7774467176E-03 > Efficiency = 6.9313634831E-01 100,102c100,102 < 1 1000 1.0083422237E+00 3.4719525088E-03 7.2084874023E-01 < 2 1000 9.9954598787E-01 3.7884632892E-03 6.9167224545E-01 < MD5 sum (including results) = '7006675E60D5DB2D2E6769024911B113' --- > 1 1000 1.0088530962E+00 3.4375735070E-03 7.2121395531E-01 > 2 1000 9.9977448991E-01 3.7774467176E-03 6.9313634831E-01 > MD5 sum (including results) = '987DD23306FA5A6C823E08880EEF263C'
and
$ diff mci_vamp_8.out ../ref-output/mci_vamp_8.ref 14,16c14,16 < Integral = 9.9878276150E-01 < Error = 4.2051292906E-03 < Efficiency = 7.7816306414E-01 --- > Integral = 1.0001833288E+00 > Error = 4.1037070130E-03 > Efficiency = 7.7917077207E-01 38,41c38,41 < 1 1000 1.0170493564E+00 6.9020855522E-03 5.6576519941E-01 < 2 999 9.9652068811E-01 5.4233200344E-03 7.3669900121E-01 < 3 998 9.9878276150E-01 4.2051292906E-03 7.7816306414E-01 < MD5 sum (including results) = '4010BE579A46EDD42D23D32BBF7EADAF' --- > 1 1000 1.0177061925E+00 6.8751470140E-03 5.6613058481E-01 > 2 999 9.9648134683E-01 5.4172752641E-03 7.3921847436E-01 > 3 998 1.0001833288E+00 4.1037070130E-03 7.7917077207E-01 > MD5 sum (including results) = 'FC4313BD22E0F2D97A2AD0F059622529' 45,46c45,46 < Integrand = 2.122673440110E-01 < Weight = 5.882328784778E+00 --- > Integrand = 2.122712286666E-01 > Weight = 5.882916885089E+00 60,62c60,62 < integral = 9.9878276150E-01 < error = 4.2051292906E-03 < eff. = 7.7816306414E-01 --- > integral = 1.0001833288E+00 > error = 4.1037070130E-03 > eff. = 7.7917077207E-01 64,65c64,65 < 1 7.33641E-01 < 2 2.66359E-01 --- > 1 7.33668E-01 > 2 2.66332E-01 70,72c70,72 < Integral = 9.9806498496E-01 < Error = 3.4123468849E-03 < Efficiency = 8.2282714049E-01 --- > Integral = 9.9807023470E-01 > Error = 3.4080042818E-03 > Efficiency = 8.2267223019E-01 94,96c94,96 < 1 1000 1.0170493564E+00 6.9020855522E-03 5.6576519941E-01 < 2 999 9.9652068811E-01 5.4233200344E-03 7.3669900121E-01 < 3 998 9.9878276150E-01 4.2051292906E-03 7.7816306414E-01 --- > 1 1000 1.0177061925E+00 6.8751470140E-03 5.6613058481E-01 > 2 999 9.9648134683E-01 5.4172752641E-03 7.3921847436E-01 > 3 998 1.0001833288E+00 4.1037070130E-03 7.7917077207E-01 104,107c104,107 < 1 998 1.0024229883E+00 3.7564726958E-03 8.1849531743E-01 < 2 998 1.0005764562E+00 3.4557099574E-03 7.8759124509E-01 < 3 998 9.9806498496E-01 3.4123468849E-03 8.2282714049E-01 < MD5 sum (including results) = '72007572789C6F6CD603E503355E7330' --- > 1 998 1.0032296438E+00 3.6593386136E-03 8.2530286641E-01 > 2 998 1.0006689706E+00 3.4577861120E-03 7.8730270750E-01 > 3 998 9.9807023470E-01 3.4080042818E-03 8.2267223019E-01 > MD5 sum (including results) = '52DFD8BF6652BA798717D5474340F2C1' 111,112c111,112 < Integrand = 3.017442719892E-01 < Weight = 1.246951474788E+00 --- > Integrand = 3.017194415775E-01 > Weight = 1.247184511648E+00 126,128c126,128 < integral = 9.9806498496E-01 < error = 3.4123468849E-03 < eff. = 8.2282714049E-01 --- > integral = 9.9807023470E-01 > error = 3.4080042818E-03 > eff. = 8.2267223019E-01 130,131c130,131 < 1 7.33641E-01 < 2 2.66359E-01 --- > 1 7.33668E-01 > 2 2.66332E-01 136,138c136,138 < Integral = 1.0028272784E+00 < Error = 3.3553855716E-03 < Efficiency = 8.2416912193E-01 --- > Integral = 1.0038188157E+00 > Error = 3.2285965190E-03 > Efficiency = 8.2482448269E-01 160,162c160,162 < 1 1000 1.0170493564E+00 6.9020855522E-03 5.6576519941E-01 < 2 999 9.9652068811E-01 5.4233200344E-03 7.3669900121E-01 < 3 998 9.9878276150E-01 4.2051292906E-03 7.7816306414E-01 --- > 1 1000 1.0177061925E+00 6.8751470140E-03 5.6613058481E-01 > 2 999 9.9648134683E-01 5.4172752641E-03 7.3921847436E-01 > 3 998 1.0001833288E+00 4.1037070130E-03 7.7917077207E-01 170,172c170,172 < 1 998 1.0024229883E+00 3.7564726958E-03 8.1849531743E-01 < 2 998 1.0005764562E+00 3.4557099574E-03 7.8759124509E-01 < 3 998 9.9806498496E-01 3.4123468849E-03 8.2282714049E-01 --- > 1 998 1.0032296438E+00 3.6593386136E-03 8.2530286641E-01 > 2 998 1.0006689706E+00 3.4577861120E-03 7.8730270750E-01 > 3 998 9.9807023470E-01 3.4080042818E-03 8.2267223019E-01 180,183c180,183 < 1 998 1.0041722704E+00 3.3719659606E-03 8.2632373322E-01 < 2 998 1.0023268375E+00 3.1610397637E-03 8.2376051973E-01 < 3 998 1.0028272784E+00 3.3553855716E-03 8.2416912193E-01 < MD5 sum (including results) = 'D13472F0E400812ED2FD51C2F4044929' --- > 1 998 1.0042238250E+00 3.3543815528E-03 8.2620616000E-01 > 2 998 1.0022973122E+00 3.1550223134E-03 8.2357720873E-01 > 3 998 1.0038188157E+00 3.2285965190E-03 8.2482448269E-01 > MD5 sum (including results) = 'D2AC1DEBDEACAC93ADF5A2F2158ECAD5' 187,188c187,188 < Integrand = 3.466447742353E-01 < Weight = 1.083663080227E+00 --- > Integrand = 3.466254182468E-01 > Weight = 1.083835666245E+00 202,204c202,204 < integral = 1.0028272784E+00 < error = 3.3553855716E-03 < eff. = 8.2416912193E-01 --- > integral = 1.0038188157E+00 > error = 3.2285965190E-03 > eff. = 8.2482448269E-01 206,207c206,207 < 1 7.33641E-01 < 2 2.66359E-01 --- > 1 7.33668E-01 > 2 2.66332E-01
and
$ diff mci_vamp_15.out ../ref-output/mci_vamp_15.ref 20,22c20,22 < Integral = 1.0028272784E+00 < Error = 3.3553855716E-03 < Efficiency = 8.2416912193E-01 --- > Integral = 1.0038188157E+00 > Error = 3.2285965190E-03 > Efficiency = 8.2482448269E-01 44,46c44,46 < 1 1000 1.0170493564E+00 6.9020855522E-03 5.6576519941E-01 < 2 999 9.9652068811E-01 5.4233200344E-03 7.3669900121E-01 < 3 998 9.9878276150E-01 4.2051292906E-03 7.7816306414E-01 --- > 1 1000 1.0177061925E+00 6.8751470140E-03 5.6613058481E-01 > 2 999 9.9648134683E-01 5.4172752641E-03 7.3921847436E-01 > 3 998 1.0001833288E+00 4.1037070130E-03 7.7917077207E-01 54,56c54,56 < 1 998 1.0024229883E+00 3.7564726958E-03 8.1849531743E-01 < 2 998 1.0005764562E+00 3.4557099574E-03 7.8759124509E-01 < 3 998 9.9806498496E-01 3.4123468849E-03 8.2282714049E-01 --- > 1 998 1.0032296438E+00 3.6593386136E-03 8.2530286641E-01 > 2 998 1.0006689706E+00 3.4577861120E-03 7.8730270750E-01 > 3 998 9.9807023470E-01 3.4080042818E-03 8.2267223019E-01 64,67c64,67 < 1 998 1.0041722704E+00 3.3719659606E-03 8.2632373322E-01 < 2 998 1.0023268375E+00 3.1610397637E-03 8.2376051973E-01 < 3 998 1.0028272784E+00 3.3553855716E-03 8.2416912193E-01 < MD5 sum (including results) = 'D13472F0E400812ED2FD51C2F4044929' --- > 1 998 1.0042238250E+00 3.3543815528E-03 8.2620616000E-01 > 2 998 1.0022973122E+00 3.1550223134E-03 8.2357720873E-01 > 3 998 1.0038188157E+00 3.2285965190E-03 8.2482448269E-01 > MD5 sum (including results) = 'D2AC1DEBDEACAC93ADF5A2F2158ECAD5' 72,74c72,74 < [vamp] 1 1000 0.0000E+00(0.00E+00) 1.017049E+00(6.90E-03) 0.0 0.000 < [vamp] 2 999 0.0000E+00(0.00E+00) 9.965207E-01(5.42E-03) 0.0 0.000 < [vamp] 3 998 0.0000E+00(0.00E+00) 9.987828E-01(4.21E-03) 0.0 0.000 --- > [vamp] 1 1000 0.0000E+00(0.00E+00) 1.017706E+00(6.88E-03) 0.0 0.000 > [vamp] 2 999 0.0000E+00(0.00E+00) 9.964813E-01(5.42E-03) 0.0 0.000 > [vamp] 3 998 0.0000E+00(0.00E+00) 1.000183E+00(4.10E-03) 0.0 0.000 80,82c80,82 < chan#001 1 500 1.7529E+00(7.34E-03) 1.752880E+00(7.34E-03) 0.0 0.974 < chan#001 2 681 1.2397E+00(5.97E-03) 1.239680E+00(5.97E-03) 0.0 0.913 < chan#001 3 732 1.1362E+00(4.19E-03) 1.136206E+00(4.19E-03) 0.0 0.881 --- > chan#001 1 500 1.7542E+00(7.24E-03) 1.754194E+00(7.24E-03) 0.0 0.975 > chan#001 2 681 1.2396E+00(5.96E-03) 1.239574E+00(5.96E-03) 0.0 0.918 > chan#001 3 732 1.1381E+00(4.00E-03) 1.138094E+00(4.00E-03) 0.0 0.882 87,88c87,88 < chan#002 2 318 4.7579E-01(1.12E-02) 4.757926E-01(1.12E-02) 0.0 0.354 < chan#002 3 266 6.2061E-01(1.08E-02) 6.206118E-01(1.08E-02) 0.0 0.490 --- > chan#002 2 318 4.7590E-01(1.13E-02) 4.758967E-01(1.13E-02) 0.0 0.354 > chan#002 3 266 6.2067E-01(1.08E-02) 6.206711E-01(1.08E-02) 0.0 0.490 94,96c94,96 < [vamp] 1 998 0.0000E+00(0.00E+00) 1.002423E+00(3.76E-03) 0.0 0.000 < [vamp] 2 998 0.0000E+00(0.00E+00) 1.000576E+00(3.46E-03) 0.0 0.000 < [vamp] 3 998 0.0000E+00(0.00E+00) 9.980650E-01(3.41E-03) 0.0 0.000 --- > [vamp] 1 998 0.0000E+00(0.00E+00) 1.003230E+00(3.66E-03) 0.0 0.000 > [vamp] 2 998 0.0000E+00(0.00E+00) 1.000669E+00(3.46E-03) 0.0 0.000 > [vamp] 3 998 0.0000E+00(0.00E+00) 9.980702E-01(3.41E-03) 0.0 0.000 102,104c102,104 < chan#001 1 732 1.1229E+00(3.80E-03) 1.122894E+00(3.80E-03) 0.0 0.913 < chan#001 2 732 1.0938E+00(2.75E-03) 1.093787E+00(2.75E-03) 0.0 0.854 < chan#001 3 732 1.0773E+00(3.62E-03) 1.077316E+00(3.62E-03) 0.0 0.886 --- > chan#001 1 732 1.1239E+00(3.61E-03) 1.123946E+00(3.61E-03) 0.0 0.923 > chan#001 2 732 1.0938E+00(2.75E-03) 1.093848E+00(2.75E-03) 0.0 0.854 > chan#001 3 732 1.0773E+00(3.61E-03) 1.077268E+00(3.61E-03) 0.0 0.886 108,110c108,110 < chan#002 1 266 6.7090E-01(9.46E-03) 6.709005E-01(9.46E-03) 0.0 0.555 < chan#002 2 266 7.4407E-01(1.05E-02) 7.440732E-01(1.05E-02) 0.0 0.599 < chan#002 3 266 7.7998E-01(8.04E-03) 7.799765E-01(8.04E-03) 0.0 0.647 --- > chan#002 1 266 6.7103E-01(9.47E-03) 6.710334E-01(9.47E-03) 0.0 0.555 > chan#002 2 266 7.4425E-01(1.05E-02) 7.442503E-01(1.05E-02) 0.0 0.599 > chan#002 3 266 7.8013E-01(8.04E-03) 7.801289E-01(8.04E-03) 0.0 0.647 116,118c116,118 < [vamp] 1 998 0.0000E+00(0.00E+00) 1.004172E+00(3.37E-03) 0.0 0.000 < [vamp] 2 998 0.0000E+00(0.00E+00) 1.002327E+00(3.16E-03) 0.0 0.000 < [vamp] 3 998 0.0000E+00(0.00E+00) 1.002827E+00(3.36E-03) 0.0 0.000 --- > [vamp] 1 998 0.0000E+00(0.00E+00) 1.004224E+00(3.35E-03) 0.0 0.000 > [vamp] 2 998 0.0000E+00(0.00E+00) 1.002297E+00(3.16E-03) 0.0 0.000 > [vamp] 3 998 0.0000E+00(0.00E+00) 1.003819E+00(3.23E-03) 0.0 0.000 124,126c124,126 < chan#001 1 732 1.0832E+00(3.54E-03) 1.083220E+00(3.54E-03) 0.0 0.890 < chan#001 2 732 1.0852E+00(3.30E-03) 1.085204E+00(3.30E-03) 0.0 0.892 < chan#001 3 732 1.0807E+00(3.72E-03) 1.080739E+00(3.72E-03) 0.0 0.888 --- > chan#001 1 732 1.0832E+00(3.51E-03) 1.083234E+00(3.51E-03) 0.0 0.890 > chan#001 2 732 1.0851E+00(3.29E-03) 1.085110E+00(3.29E-03) 0.0 0.891 > chan#001 3 732 1.0820E+00(3.51E-03) 1.082037E+00(3.51E-03) 0.0 0.889 130,132c130,132 < chan#002 1 266 7.8664E-01(8.08E-03) 7.866428E-01(8.08E-03) 0.0 0.650 < chan#002 2 266 7.7426E-01(7.61E-03) 7.742583E-01(7.61E-03) 0.0 0.637 < chan#002 3 266 7.8842E-01(7.32E-03) 7.884231E-01(7.32E-03) 0.0 0.648 --- > chan#002 1 266 7.8680E-01(8.08E-03) 7.867974E-01(8.08E-03) 0.0 0.650 > chan#002 2 266 7.7441E-01(7.61E-03) 7.744080E-01(7.61E-03) 0.0 0.637 > chan#002 3 266 7.8857E-01(7.32E-03) 7.885731E-01(7.32E-03) 0.0 0.648
comment:13 Changed 8 years ago by
OK, so it's not a crash, it's numerical noise. This happens only with extended prec, right?
Maybe some register/RAM storage sequence is affected by those code changes. Maybe this causes a precision loss in 4.7 but not in later versions.
That would require tedious print-statement debugging. If the problem is tied just to that particular configuration, my recommendation would be to update the footnote to manual Sec. 2.2.4 (which already mentions gfortran 4.7 series) and close the ticket as wontfix.
comment:14 Changed 8 years ago by
Probably correct. And the reason why quadruple precision is not affected is that this is working via the libquadmath implementation anyways. Unfortunately, it is not that easy, as I would lose my whole testing setup at DESY. Or it would be easy, as I would switch to Ubuntu machines and the preinstalled/-compiled 4.8.X. I will do a recompilation on the DESY system, but that old SL6.X system is anyhow a pain. Funnily it works on the 32bit machine and SL5.
comment:15 Changed 8 years ago by
AHA!!!! WK, you were right! It is not really the presence of the two components mu_plus and mu_minus in the type vamp_grid_t, but setting those components later on. Getting closer to the true start of the problem.
comment:16 Changed 8 years ago by
Summary: | mci_vamp unit tests fails with gfortran 4.7.4 → mci_vamp unit tests fails with gfortran 4.7.4/4.8.3 on SL6.7 |
---|
Actually that one is the culprit:
g%mu_plus(2) = g%mu_plus(2) * g%dv2g if (g%mu_plus(2) < eps * max (g%mu_plus(1)**2, 1._default)) then g%mu_plus(2) = eps * max (g%mu_plus(1)**2, 1._default) end if g%mu_minus(2) = g%mu_minus(2) * g%dv2g if (g%mu_minus(2) < eps * max (g%mu_minus(1)**2, 1._default)) then g%mu_minus(2) = eps * max (g%mu_minus(1)**2, 1._default) end if
Moving that to the end of the subroutine, doesn't help unfortunately.
comment:17 Changed 8 years ago by
That means, if you comment out those lines, the vamp tests agree (except for the new that wouldn't work then), and as given, they disagree.
There may be hidden temporaries involved. For instance, since g%dv2g is referenced several time (I think this is an array), it could be copied.
One suggestion: instead of using array notation, what about expanding in explicit loops, at least for the extra code? If the elements are accessed one at a time, there might be no temporary generated.
comment:18 Changed 8 years ago by
... no, those are scalars. Forget the part about loops. It may be registers where temporaries are stored (or not).
comment:19 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Solved by wrapping the culprit lines in a negative weights if-clause. Still have to recheck some of the files and all the different precisions. Closing for now.
This is so weird, it works on 32bit. Did a recompilation (make clean, not make distclean) on 64bit. Still the same result. mu_minus and mu_plus cannot affect the results on these tests, this here is the only code change by WK that could affect this test.: