whizard is hosted by Hepforge, IPPP Durham

Opened 11 years ago

Closed 11 years ago

#485 closed defect (fixed)

I/O error on parallel output

Reported by: Juergen Reuter Owned by: kilian
Priority: P0 Milestone: v2.2.0
Component: core Version: 2.1.1
Severity: blocker Keywords:
Cc:

Description

The WHIZARD unit tests do not work with NAGFOR, when performing

make -j distcheck

Change History (28)

comment:1 Changed 11 years ago by Juergen Reuter

Actually, the strange error message on Jenkins I cannot reproduce, but when I run it explicitly I get:

Running script ./prclib_interfaces.run
Runtime Error: I/O error on unit 11: No such file or directory
Program terminated by fatal I/O error
whizard.f90, line 92: Error occurred in WHIZARD:WHIZARD_CHECK
main.f90, line 336: Called by MAIN
Aborted

The funny thing is, that it works when running

make -j check

Could it be a clash of output units?

comment:2 Changed 11 years ago by Juergen Reuter

Happens also for make -j check, same error message.

comment:3 Changed 11 years ago by Juergen Reuter

I think I understood the problem now, but I'm tired of typing things in and never accessing any people. Please write me a letter to Alpha Centauri.

comment:4 Changed 11 years ago by Juergen Reuter

Priority: P3P0
Severity: normalblocker

Actually, this also affects using WHIZARD in batch mode.

comment:5 Changed 11 years ago by Juergen Reuter

Summary: Catch strange distcheck error with NAGFOR 5.3I/O error on parallel output

Ok, none interested ... going home now

comment:6 Changed 11 years ago by kilian

Thanks for checking. This is NOT a critical issue as long as the program doesn't do anything useful anyway, but if it's indeed the I/O units, should be easy to fix.

comment:7 Changed 11 years ago by kilian

Turning to this problem:

(1) there was an unrelated race condition due to a hard-coded log filename. This is fixed in r4051.

(2) However, the crash on jenkins is completely mysterious. Can anybody reproduce it anywhere? Also google doesn't yield anything useful on the Make error message "read jobs pipe: no such file or directory". Note: WHIZARD is calling Make here, with no options, while the WHIZARD job itself is run from make with -j option. This is indirect recursion, but at the OS level. Problem?

comment:8 Changed 11 years ago by Juergen Reuter

The problem I reported in this ticket seems to be solved by WK's fix. Make check and make distcheck now seem to work. So it must be the make invoked by the Jenkins which does the bogus.

comment:9 Changed 11 years ago by Juergen Reuter

Now it continues a bit further but runs again in this error. All I found on Google was that always the -j option is responsible. But always it was mentioned as a response from the make team that it is a race condition error in the makefiles :(

comment:10 Changed 11 years ago by kilian

New result: removing the -j2 option allows the check with nagfor to run through.

Speculation: on the jenkins machine (6core) there were five jobs running concurrently, each one make -j2 (and each one calling make again by indirect recursion). The first one (gfortran 4.6.0) always crashes, so four parallel jobs were running. The last one, namely the nagfor job, had the problem. Related to the number of available cores?

Currently checking this by varying the number of concurrent jobs.

comment:11 Changed 11 years ago by kilian

No, it's not the number of cores.

comment:12 Changed 11 years ago by Juergen Reuter

Ok, I lost track here, what is the status now? What are we going to do?

comment:13 Changed 11 years ago by kilian

The problem occurs only with the nagfor build, if make -j is in effect. For no apparent reason. After all, this happens during a Make call, executed by a shell spawned by SYSTEM. No reference to the compiler ...

None of us has seen a problem on a system other than Jenkins. So a workaround is to switch off the -j option for the NAG build on Jenkins. Users wouldn't be affected, since this is an artefact of the test setup anyway. Unfortunately, we still don't have a clue ... Opinions?

comment:14 Changed 11 years ago by Juergen Reuter

well, if that is configurable... but we need the make -j test for the other jobs! I lost track of the different compilers somehow.

comment:15 Changed 11 years ago by kilian

With the other jobs, make -j seems to work. So we'll change the configuration accordingly.

comment:16 Changed 11 years ago by Juergen Reuter

Actually, the problem seems to be known and known to be difficult: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614701

comment:17 Changed 11 years ago by Juergen Reuter

Here is more debugging info, let's see whether it helps:

/var/lib/jenkins/racecond_test/trunk/build/src/omega/bin/omega_QED.opt -o omega2_a_i1.f90 -target:whizard -target:parameter_module parameters_QED -target:module omega2_a_i1 -target:md5sum '                                ' -fusion:progress -scatter 'e- e+ -> e- e+'
Putting child 0x01f42560 (omega2_a_i1.f90) PID 16408 on the chain.
      Commands of `omega2_a_i1.f90' are being run.
     Finished prerequisites of target file `omega2_a_i1.lo'.
    The prerequisites of `omega2_a_i1.lo' are being made.
    Considering target file `omega2_a_i2.lo'.
     File `omega2_a_i2.lo' does not exist.
      Considering target file `omega2_a_i2.f90'.
       File `omega2_a_i2.f90' does not exist.
       Finished prerequisites of target file `omega2_a_i2.f90'.
      Must remake target `omega2_a_i2.f90'.
Live child 0x01f42560 (omega2_a_i1.f90) PID 16408
Need a job token; we have children
Live child 0x01f42560 (omega2_a_i1.f90) PID 16408
make[4]: *** read jobs pipe: No such file or directory.  Stop.
make[4]: *** Waiting for unfinished jobs....
Live child 0x01f42560 (omega2_a_i1.f90) PID 16408
[1/1] e- e+ -> e- e+ ... allowed. [time: 0.00 secs, total: 0.00 secs, remaining: 0.00 secs]
all processes done. [total time: 0.00 secs]
SUMMARY: 6 fusions, 2 propagators, 2 diagrams
Reaping winning child 0x01f42560 PID 16408
Removing child 0x01f42560 PID 16408 from chain.
make[4]: Leaving directory `/var/lib/jenkins/racecond_test/trunk/build/test'
Running test: omega_interface_1 ... success.
Running test: omega_interface_2| command: make compile -f omega2.makefile
| Return code = 512
******************************************************************************
******************************************************************************
*** FATAL ERROR: System command returned with nonzero status code
******************************************************************************
******************************************************************************
WHIZARD run aborted.

comment:18 Changed 11 years ago by Juergen Reuter

This is strange: why does it a double compile command, and why does it want to compile the code BEFORE it is being made by O'Mega:

make -n -j20 compile -f omega2.makefile
/var/lib/jenkins/racecond_test/trunk/build/libtool --mode=compile nagfor -c -I/var/lib/jenkins/racecond_test/trunk/build/src/models -I/var/lib/jenkins/racecond_test/trunk/build/src/omega/src -I/var/lib/jenkins/racecond_test/trunk/build/src/whizard-core -I/var/lib/jenkins/racecond_test/trunk/build/src/misc -C=all -nan -gline -w -PIC omega2_a_i1.f90
/var/lib/jenkins/racecond_test/trunk/build/src/omega/bin/omega_QED.opt -o omega2_a_i2.f90 -target:whizard -target:parameter_module parameters_QED -target:module omega2_a_i2 -target:md5sum '                                ' -target:openmp -scatter 'e- e+ -> e- e+' -cascade '3+4~A' -fusion:progress_file omega2.log
/var/lib/jenkins/racecond_test/trunk/build/libtool --mode=compile nagfor -c -I/var/lib/jenkins/racecond_test/trunk/build/src/models -I/var/lib/jenkins/racecond_test/trunk/build/src/omega/src -I/var/lib/jenkins/racecond_test/trunk/build/src/whizard-core -I/var/lib/jenkins/racecond_test/trunk/build/src/misc -C=all -nan -gline -w -PIC omega2_a_i2.f90
/var/lib/jenkins/racecond_test/trunk/build/libtool --mode=compile nagfor -c -I/var/lib/jenkins/racecond_test/trunk/build/src/models -I/var/lib/jenkins/racecond_test/trunk/build/src/omega/src -I/var/lib/jenkins/racecond_test/trunk/build/src/whizard-core -I/var/lib/jenkins/racecond_test/trunk/build/src/misc -C=all -nan -gline -w -PIC omega2.f90

comment:19 Changed 11 years ago by Juergen Reuter

Funny thing is: make -j works, make -j2, make -j3 and so on, doesn't.

comment:20 Changed 11 years ago by Juergen Reuter

Wow, it seems it is some issue with the generation of the Fortran code. Using a statement:

.NOTPARALLEL: source

in the Makefile generated by WHIZARD solves the problem. There is another fishy issue with the way,the sources are produced: as source: $(SOURCES) and $(SOURCES) is filled up only later, it seems that only make compile triggers the code generation:

 make -n source -f omega2.makefile
make: Nothing to be done for `source'.

This is clearly NOT what was intended! I cannot really prove it but it looks like the generation of the Fortran code by O'Mega causes the race condition. If that's the case could it be that the NAG compiler is faster in compilation and so demands the .f90 file being present earlier, and that's the reason why it occurs with nagfor but not gfortran ?

If we don't solve the race condition, my suggestion would be to include this .NOTPARALLEL flag such that we can run all compilers with the same -jN flag.

Opinions?

comment:21 in reply to:  18 Changed 11 years ago by kilian

Replying to jr_reuter:

This is strange: why does it a double compile command, and why does it want to compile the code BEFORE it is being made by O'Mega:

Looks perfectly ok. In your test case, the f90 file generated for the process omega2_a_i1 was apparently still present, so no need to remake it. All three sources omega2_a_i1, omega2_a_i2, and omega2_a are compiled exactly once.

comment:22 in reply to:  19 Changed 11 years ago by kilian

Replying to jr_reuter:

Funny thing is: make -j works, make -j2, make -j3 and so on, doesn't.

Important observation! Yes, I can reproduce this with -j2 on my PC. Previously, I only tried -j, no error.

comment:23 in reply to:  20 Changed 11 years ago by kilian

Replying to jr_reuter:

There is another fishy issue with the way,the sources are produced: as source: $(SOURCES) and $(SOURCES) is filled up only later,

it seems that only make compile triggers the code generation:

 make -n source -f omega2.makefile
make: Nothing to be done for `source'.

This is clearly NOT what was intended!

Yes. Is there anybody who REALLY understands how to properly set up a Makefile? I'll fix this, although it probably doesn't solve the -j2 issue.

comment:24 Changed 11 years ago by kilian

OK, here is the culprit.

Running make -j check:

MAKEFLAGS='wj -- TEST_LOGS=prclib_interfaces.run.log\ process_libraries.run.log\ test_me.run.log\ processes.run.log\ omega_interface.run.log'
MAKELEVEL='4'
MAKEOVERRIDES='${-*-command-variables-*-}'
MFLAGS='-wj'

Running make -j2 check:

MAKEFLAGS='w --jobserver-fds=3,4 -j -- TEST_LOGS=prclib_interfaces.run.log\ process_libraries.run.log\ test_me.run.log\ processes.run.log\ omega_interface.run.log'
MAKELEVEL='4'
MAKEOVERRIDES='${-*-command-variables-*-}'
MFLAGS='-w --jobserver-fds=3,4 -j'

The --jobserver option has failed, obviously. Unfortunately, this option is not documented.

The Make manual has an interesting paragraph:

‘warning: -jN forced in submake: disabling jobserver mode.’

This warning and the next are generated if make detects error conditions related to parallel processing on systems where sub-makes can communicate (see Communicating Options to a Sub-make). This warning is generated if a recursive invocation of a make process is forced to have ‘-jN’ in its argument list (where N is greater than one). This could happen, for example, if you set the MAKE environment variable to ‘make -j2’. In this case, the sub-make doesn't communicate with other make processes and will simply pretend it has two jobs of its own.

‘warning: jobserver unavailable: using -j1. Add `+' to parent make rule.’

In order for make processes to communicate, the parent will pass information to the child. Since this could result in problems if the child process isn't actually a make, the parent will only do this if it thinks the child is a make. The parent uses the normal algorithms to determine this (see How the MAKE Variable Works). If the makefile is constructed such that the parent doesn't know the child is a make process, then the child will receive only part of the information necessary. In this case, the child will generate this warning message and proceed with its build in a sequential manner.

I guess that the sub-make should have acted according to the second warning, but failed to do so and crashed instead.

comment:25 Changed 11 years ago by Juergen Reuter

Ok, cool. Does this mean we will have an obvious solution to this soon-ish?

comment:26 Changed 11 years ago by kilian

The issue should be fixed in r4063. For the critical tests (and probably any further tests that call make), I set the "-j1" option explicitly.

However, this is probably specific to GNU make (the whole problem). A make that doesn't recognize -j1 may fail. I suggest to add a configure check and use the result in the test cases?

comment:27 Changed 11 years ago by Juergen Reuter

There is an autoconf macro which apparently checks whether make is GNU make:

http://www.gnu.org/software/autoconf-archive/ax_check_gnu_make.html#ax_check_gnu_make

Question is whether this is sufficient.

comment:28 Changed 11 years ago by Juergen Reuter

Resolution: fixed
Status: newclosed

This should be settled finally in r4064. There is maybe still room for improvement (e.g. running only the test with the -j1 flag), but now everything is triggered by the check for GNU make. Closing.

Note: See TracTickets for help on using tickets.