一定一定得避免原子操作,因为对于性能的影响实在是太明显了,例如,throughput从800MBps骤降至110MBps,
看论坛是看到有人转述的一筒子的话,记录于下:
honestly, if you're trying to do this you're probably going down the wrong path, but general rules of thumb are
-
don't have multiple threads within a warp contending for a lock, that
leads to all sorts of confusing issues for most people because
inter-warp branches are not the same as intra-warp branches
- avoid
global memory contention as much as possible (e.g., if you need to have a
critical section among all warps in all CTAs, do per-CTA shared memory
locks then a global lock)
- traditional threading primitives
implemented with atomics are a pretty terrible idea, if you can avoid
atomics as much as possible (or entirely) you can get a big perf win
(and there are very interesting ways you can do this, and when I say big
perf win, I mean on the order of 5-10x)
("well," you think, "it sounds like tim is speaking from experience!" oh yes, I am)