Opened 14 years ago
Closed 10 years ago
#423 closed enhancement (wontfix)
Improve poor scaling of the parallelized helicity loop
Reported by: | kilian | Owned by: | speckner |
---|---|---|---|
Priority: | P3 | Milestone: | v2.3.0 |
Component: | core | Version: | 2.0.5 |
Severity: | normal | Keywords: | |
Cc: |
Description
There are some processes where reportedly the helicity loop doesn't scale at all with the number of threads (ask MT). Maybe there is still room for improvement.
Change History (12)
comment:1 Changed 14 years ago by
comment:2 Changed 14 years ago by
Example:
alias parton = u:U:d:D:g alias lepton = e1:E1:e2:E2 process badscaling = e1, E1 => lepton, lepton, parton, parton, A, A
This is CS's example, but the bad scaling is confirmed by MT. Almost no speedup in helicity loop, although it takes ~70% of the time according to gprof. The rest (as recently parallelized by MT) scales well.
comment:3 Changed 14 years ago by
Puh, I had hoped for something smaller. My suspicion is that this process is so big that the data of a single thread already fills the L3 cache. It would be interesting to estimate the size of the thread_local_data
structure for this case. Cachegrind might also be able to tell more, but I guess this is futile for processes of this size.
comment:5 Changed 14 years ago by
Yes, definitely. It looks like if a process doesn't fit into the cache, the OpenMP parallelization is pointless. This would affect ALL large processes. However, this is just speculation right now.
comment:6 Changed 14 years ago by
I agree. If the cores share a single L3 cache (which, to my knowledge, is usually the case), then any process which runs close to exhausting it will definitely not scale; the performance of the threaded loop will be bounded by memory transfers. However, I also am not sure that we are already hitting this limit. Does anyone have access to a true multiprocessor SMP system to test this hypothesis? If two threads run on physically different CPUs, then each gets its own L3 cache, and scaling should be substantially better (taskset
can be used on linux to pin the threads to different CPUs).
comment:7 Changed 14 years ago by
Owner: | changed from trudewind to speckner |
---|
comment:8 Changed 14 years ago by
Milestone: | v2.0.6 → v2.0.7 |
---|
comment:9 Changed 13 years ago by
What was actually the agreement on how to proceed with that? Further investigations? Who will do it? When?
comment:10 Changed 13 years ago by
MT told me yesterday that he wants to run more tests on that, once MPI is working.
comment:11 Changed 13 years ago by
Milestone: | v2.1.0 → v2.2.0 |
---|
I believe there is also nothing that follows up on that one...
comment:12 Changed 10 years ago by
Resolution: | → wontfix |
---|---|
Status: | new → closed |
At the moment this is not relevant any more.
Could MT provide an example?