Exophase wrote:
dmdm's further post here:
http://www.gp32x.com/board/index.php?sh ... 77&st=113#Suggests 1 operation per cycle, not 4 (which is what you guys are saying, and of course what the referenced materials imply, but they're quite vague on details). Hence my questioning of what "concurrent" really means. Also, I'm pretty skeptical that "SIMD" means anything more than 4-way 32bit operations that execute on a single unit as a single operation. Although I realize the term has been misused already by other GPUs I really doubt that it means more than this.
He also says that even more than one FP op can be done per cycle, but then goes on to use an FMAC as an example of this, so I doubt that this has any deep meaning other than the usual fused mul/add unit.
ok, frankly, i'm not sure what i'm saying anymore (and for sure never claimed it can perform 4 32bit ops/clock ; )
the whole 'scatter/gather SIMD over 4 threads' point i've been promoting so far is clearly a no-go for floats (as maciek has been trying to hammer down my cortex : ), i.e. something like that may still be used for the lower-width data types, but apparently does not help for floats. so far everything indicates that the fp/clock powers of the SGX are not great - namely, 1 op per clock per shader unit. apropos, something has been bothering me since the beginning in that SCH datasheet, namely (9.1.1 3D Core Key Features): '— Vertex Rate: One Triangle 15 clocks (Transform Only)'.
so here the '1 tri/clock' is the typical PR speak used to imply 1 vertex/tri ratio, which is ok. what is bothering is the quote of 15 clocks/vertex (transform only). now, 'transform only' clearly implies a sole MVP (model-view-projection) transform operation. now, if we assume MAD (multiply-add) ops would be used to do that (and the shader units are entirely devoted to vertex work), that is 4x 4-wide MADs per vertex transformed, swizzles non-widthstanding (then again, we have 1 float/register, there's nothing to swizzle there). or IOW, these are 16 ops (ok, most/all of them fused, but that's not the point), carried in.. 15 clocks!? does anybody else find the ratio of ~1 op/clock rather bothersome? i do. let me explain why: because that means that only one shader unit can work on one thread at a time.
what follows from that? well, one of the following two things:
a) either SCH's SGX has one (1) shader unit in total (in which case i cannot see how threads can execute concurrently, at all), or
b) there's no 'scattering SIMD' in SGX, the way i've been expecting it, so multiple registers from one context cannot be dispatched to multple shader units at a time (in the example: there is only one shader working on the MVP transform at any time)
either way, there's something rotten in datasheetland. oh, well.