whizard is hosted by Hepforge, IPPP Durham

Opened 10 years ago

Closed 10 years ago

#688 closed defect (worksforme)

Probably race condition in WHIZARD core build

Reported by: Juergen Reuter Owned by: kilian
Priority: P0 Milestone: v2.2.3
Component: core Version: 2.2.2
Severity: blocker Keywords:
Cc:

Description

On both AFS file systems and old scratches I get problems with fresh WHIZARD builds:

for src in particle_specifiers.f90 analysis.f90 pdg_arrays.f90 jets.f90 subevents.f90 variables.f90 expr_base.f90; do \
	  /afs/desy.de/group/theorie/software/ELF64/bin/notangle -R[[$src]] ../../../src/noweb-frame/whizard-prelude.nw ../../../src/types/types.nw ../../../src/noweb-frame/whizard-postlude.nw | /afs/desy.de/group/theorie/software/ELF64/bin/cpif $src; \
        done
for src in particle_specifiers.f90 analysis.f90 pdg_arrays.f90 jets.f90 subevents.f90 variables.f90 expr_base.f90; do \
	  /afs/desy.de/group/theorie/software/ELF64/bin/notangle -R[[$src]] ../../../src/noweb-frame/whizard-prelude.nw ../../../src/types/types.nw ../../../src/noweb-frame/whizard-postlude.nw | /afs/desy.de/group/theorie/software/ELF64/bin/cpif $src; \
        done
undefined chunk name: <<Expr base: procedures>>
undefined chunk name: <<Expr base: procedures>>

mv: cannot stat `types.tmp': No such file or directory
make[3]: *** [types.stamp] Error 1
make[3]: Leaving directory `/afs/desy.de/group/theorie/software/packages/whizard_extended/build/src/types'
make[2]: *** [expr_base.f90] Error 2
make[2]: *** Waiting for unfinished jobs....

This is most probably a race condition.

Change History (10)

comment:1 Changed 10 years ago by kilian

The notangle command should not be executed twice. But .... why didn't this happen before?

This was with make -j, right?

comment:2 Changed 10 years ago by Juergen Reuter

Yep, exactly. But afaik Jenkins does make -j2, I did make -j. A make after the occurrence of the error works then without problems.

comment:3 Changed 10 years ago by kilian

Does it always occur in that subdir? AFACS there is no difference in logic compared to the other subdirs

comment:4 Changed 10 years ago by kilian

Looking at this,

types.stamp: $(PRELUDE) $(srcdir)/types.nw $(POSTLUDE)
	@rm -f types.tmp
	@touch types.tmp
	for src in $(libtypes_la_SOURCES); do \
	  $(NOTANGLE) -R[[$$src]] $^ | $(CPIF) $$src; \
        done
	@mv -f types.tmp types.stamp

$(libtypes_la_SOURCES): types.stamp
## Recover from the removal of $@
	@if test -f $@; then :; else \
	  rm -f types.stamp; \
	  $(MAKE) $(AM_MAKEFLAGS) types.stamp; \
	fi

the code was not designed with parallel make in mind. Just strange that it didn't bite us before. We have had this in whizard-core/Makefile for ages. I don't remember where we got it from, but it wasn't our design.

Any idea? Is there a possibility to mark Makefile sections as critical, so they are executed serially?

comment:5 Changed 10 years ago by Juergen Reuter

Can this part of the Makefile manual help us:

.NOTPARALLEL

    If .NOTPARALLEL is mentioned as a target, then this invocation of make will be run serially, even if the ‘-j’ option is given. Any recursively invoked make command will still run recipes in parallel (unless its makefile also contains this target). Any prerequisites on this target are ignored. 

This seems to be only GNU make, tho. But maybe that would be an option: Put the stamp stuff into a separate Makefile Makefile.<web>.stamp and then do something like:

generate_<web>_stamp:
    $(MAKE) -j1 -f Makefile.<web>.stamp

comment:6 Changed 10 years ago by kilian

Not convinced, unfortunately.

The .NOTPARALLEL target applies to the whole Makefile, not just to a section, if I understand the description. And I don't think the 'stamp' idiom would do its job if it appears in a sub-make.

comment:7 Changed 10 years ago by Juergen Reuter

Funnily, after the instances this morning this never happened again, besides several attempts to do complete recompilations. If this is a race condition, why did it never happen before? So what do we do about this ticket?

comment:8 Changed 10 years ago by Juergen Reuter

Couldn't we just replace this line

$(MAKE) $(AM_MAKEFLAGS) types.stamp;

by

$(MAKE) $(AM_MAKEFLAGS) -j1 types.stamp;

?

Last edited 10 years ago by Juergen Reuter (previous) (diff)

comment:9 in reply to:  7 Changed 10 years ago by kilian

Replying to jr_reuter:

Funnily, after the instances this morning this never happened again, besides several attempts to do complete recompilations. If this is a race condition, why did it never happen before?

Exactly the question I had. It may be particularly bad timing, e.g. checking types.stamp just between its deletion and re-creation.

Or, just by chance, could it be a wallclock timing mismatch between the build machine and the afs or file server? I had this some time ago, also with strange and unpredictable results.

So what do we do about this ticket?

If it can't be reproduced, I'd ignore it for the moment and proceed. But watch out.

comment:10 Changed 10 years ago by Juergen Reuter

Resolution: worksforme
Status: newclosed

As this doesn't bite at the moment, we are closing it (to be reopened if that might reappear)

Note: See TracTickets for help on using tickets.