Opened 11 years ago
Closed 11 years ago
#688 closed defect (worksforme)
Probably race condition in WHIZARD core build
Reported by: | Juergen Reuter | Owned by: | kilian |
---|---|---|---|
Priority: | P0 | Milestone: | v2.2.3 |
Component: | core | Version: | 2.2.2 |
Severity: | blocker | Keywords: | |
Cc: |
Description
On both AFS file systems and old scratches I get problems with fresh WHIZARD builds:
for src in particle_specifiers.f90 analysis.f90 pdg_arrays.f90 jets.f90 subevents.f90 variables.f90 expr_base.f90; do \ /afs/desy.de/group/theorie/software/ELF64/bin/notangle -R[[$src]] ../../../src/noweb-frame/whizard-prelude.nw ../../../src/types/types.nw ../../../src/noweb-frame/whizard-postlude.nw | /afs/desy.de/group/theorie/software/ELF64/bin/cpif $src; \ done for src in particle_specifiers.f90 analysis.f90 pdg_arrays.f90 jets.f90 subevents.f90 variables.f90 expr_base.f90; do \ /afs/desy.de/group/theorie/software/ELF64/bin/notangle -R[[$src]] ../../../src/noweb-frame/whizard-prelude.nw ../../../src/types/types.nw ../../../src/noweb-frame/whizard-postlude.nw | /afs/desy.de/group/theorie/software/ELF64/bin/cpif $src; \ done undefined chunk name: <<Expr base: procedures>> undefined chunk name: <<Expr base: procedures>> mv: cannot stat `types.tmp': No such file or directory make[3]: *** [types.stamp] Error 1 make[3]: Leaving directory `/afs/desy.de/group/theorie/software/packages/whizard_extended/build/src/types' make[2]: *** [expr_base.f90] Error 2 make[2]: *** Waiting for unfinished jobs....
This is most probably a race condition.
Change History (10)
comment:1 Changed 11 years ago by
comment:2 Changed 11 years ago by
Yep, exactly. But afaik Jenkins does make -j2, I did make -j. A make after the occurrence of the error works then without problems.
comment:3 Changed 11 years ago by
Does it always occur in that subdir? AFACS there is no difference in logic compared to the other subdirs
comment:4 Changed 11 years ago by
Looking at this,
types.stamp: $(PRELUDE) $(srcdir)/types.nw $(POSTLUDE) @rm -f types.tmp @touch types.tmp for src in $(libtypes_la_SOURCES); do \ $(NOTANGLE) -R[[$$src]] $^ | $(CPIF) $$src; \ done @mv -f types.tmp types.stamp $(libtypes_la_SOURCES): types.stamp ## Recover from the removal of $@ @if test -f $@; then :; else \ rm -f types.stamp; \ $(MAKE) $(AM_MAKEFLAGS) types.stamp; \ fi
the code was not designed with parallel make in mind. Just strange that it didn't bite us before. We have had this in whizard-core/Makefile for ages. I don't remember where we got it from, but it wasn't our design.
Any idea? Is there a possibility to mark Makefile sections as critical, so they are executed serially?
comment:5 Changed 11 years ago by
Can this part of the Makefile manual help us:
.NOTPARALLEL If .NOTPARALLEL is mentioned as a target, then this invocation of make will be run serially, even if the ‘-j’ option is given. Any recursively invoked make command will still run recipes in parallel (unless its makefile also contains this target). Any prerequisites on this target are ignored.
This seems to be only GNU make, tho. But maybe that would be an option: Put the stamp stuff into a separate Makefile Makefile.<web>.stamp and then do something like:
generate_<web>_stamp: $(MAKE) -j1 -f Makefile.<web>.stamp
comment:6 Changed 11 years ago by
Not convinced, unfortunately.
The .NOTPARALLEL target applies to the whole Makefile, not just to a section, if I understand the description. And I don't think the 'stamp' idiom would do its job if it appears in a sub-make.
comment:7 follow-up: 9 Changed 11 years ago by
Funnily, after the instances this morning this never happened again, besides several attempts to do complete recompilations. If this is a race condition, why did it never happen before? So what do we do about this ticket?
comment:8 Changed 11 years ago by
Couldn't we just replace this line
$(MAKE) $(AM_MAKEFLAGS) types.stamp;
by
$(MAKE) $(AM_MAKEFLAGS) -j1 types.stamp;
?
comment:9 Changed 11 years ago by
Replying to jr_reuter:
Funnily, after the instances this morning this never happened again, besides several attempts to do complete recompilations. If this is a race condition, why did it never happen before?
Exactly the question I had. It may be particularly bad timing, e.g. checking types.stamp just between its deletion and re-creation.
Or, just by chance, could it be a wallclock timing mismatch between the build machine and the afs or file server? I had this some time ago, also with strange and unpredictable results.
So what do we do about this ticket?
If it can't be reproduced, I'd ignore it for the moment and proceed. But watch out.
comment:10 Changed 11 years ago by
Resolution: | → worksforme |
---|---|
Status: | new → closed |
As this doesn't bite at the moment, we are closing it (to be reopened if that might reappear)
The notangle command should not be executed twice. But .... why didn't this happen before?
This was with make -j, right?