whizard is hosted by Hepforge, IPPP Durham
close Warning: Error with navigation contributor "BrowserModule"

Opened 14 years ago

Closed 10 years ago

#423 closed enhancement (wontfix)

Improve poor scaling of the parallelized helicity loop

Reported by: kilian Owned by: speckner
Priority: P3 Milestone: v2.3.0
Component: core Version: 2.0.5
Severity: normal Keywords:
Cc:

Description

There are some processes where reportedly the helicity loop doesn't scale at all with the number of threads (ask MT). Maybe there is still room for improvement.

Change History (12)

comment:1 Changed 14 years ago by Christian Speckner

Could MT provide an example?

comment:2 Changed 14 years ago by kilian

Example:

alias parton = u:U:d:D:g
alias lepton = e1:E1:e2:E2
process badscaling = e1, E1 => lepton, lepton, parton, parton, A, A

This is CS's example, but the bad scaling is confirmed by MT. Almost no speedup in helicity loop, although it takes ~70% of the time according to gprof. The rest (as recently parallelized by MT) scales well.

comment:3 Changed 14 years ago by Christian Speckner

Puh, I had hoped for something smaller. My suspicion is that this process is so big that the data of a single thread already fills the L3 cache. It would be interesting to estimate the size of the thread_local_data structure for this case. Cachegrind might also be able to tell more, but I guess this is futile for processes of this size.

comment:4 Changed 14 years ago by Juergen Reuter

So how shall we proceed here? Is it worth looking into it?

comment:5 Changed 14 years ago by kilian

Yes, definitely. It looks like if a process doesn't fit into the cache, the OpenMP parallelization is pointless. This would affect ALL large processes. However, this is just speculation right now.

comment:6 Changed 14 years ago by Christian Speckner

I agree. If the cores share a single L3 cache (which, to my knowledge, is usually the case), then any process which runs close to exhausting it will definitely not scale; the performance of the threaded loop will be bounded by memory transfers. However, I also am not sure that we are already hitting this limit. Does anyone have access to a true multiprocessor SMP system to test this hypothesis? If two threads run on physically different CPUs, then each gets its own L3 cache, and scaling should be substantially better (taskset can be used on linux to pin the threads to different CPUs).

comment:7 Changed 14 years ago by ohl

Owner: changed from trudewind to speckner

comment:8 Changed 14 years ago by Juergen Reuter

Milestone: v2.0.6v2.0.7

comment:9 Changed 13 years ago by Juergen Reuter

What was actually the agreement on how to proceed with that? Further investigations? Who will do it? When?

comment:10 Changed 13 years ago by kilian

MT told me yesterday that he wants to run more tests on that, once MPI is working.

comment:11 Changed 13 years ago by Juergen Reuter

Milestone: v2.1.0v2.2.0

I believe there is also nothing that follows up on that one...

comment:12 Changed 10 years ago by Juergen Reuter

Resolution: wontfix
Status: newclosed

At the moment this is not relevant any more.

Note: See TracTickets for help on using tickets.