ok, i realize i may have got slightly carried away in the argument with Exophase - i see now he was arguing from the POV of 'general general' computing, while i was talking of the 'GP GPU' computing (where tasks are still GPU-friendly, just don't produce fancy pictures on screen). of course a cpu SIMD/DSP would be better when used for its strengths (higher clocks, coupled with good cache-ability/predictable access patterns, etc), and the GPU - for its massive-multi-threading strenghts, and efficiency will be always high, everybody will be happy, ect*
There's a serious question as to what "concurrency" means here, and I think that ImgTech has been using creative language to try to misrepresent what this means. Because I think that what it really means is that four threads exist simultaneously and can be switched between with zero overhead, not that they EXECUTE simultaneously. By having two execution pipelines that clearly suggests that it can do 1 operation per cycle per USSE, not 4. Meaning 2 FMACs per cycle, and "SIMD" only refers to 4-way 8bit or 2-way 16bit, which is nice for color effects but useless for floating point operations (ie, no single cycle dot product).
well, a SIMD does not need be fed from the same context - it just needs to perform the same op over the same data types, that's all. where the data comes from is of no concert to the SIMD unit, and in this case i believe
SGX feeds its two shader units from (up to) 4 thread contexts, from a pool of 16. if that's the case, it'd be fair to say that you have (up to) '4 concurrent threads in flight' at each and every moment - it's not just PR talk.
* though, it'd be curious to note that in practice, the OMAP3 seems to be present one 'conflict of interests' between DSP and GPU - apparently TI are relying on their DSP for their video codec needs, which the SGX is, at least on paper, capable of handling too, but is shunned by TI. anyway, that's just a musing to the topic at hand.
Let's find out if I understand you correctly. There is 16 contexts, and scheduler picks 4 and passes it to 2 shader as 2 pairs units.
If it would be so - each shader unit should be able to process 2 floating point variables at time, so there should be 64-bit SIMD registers...
Or I don't understand something?
well, it depens how you interpert the '4 concurrent threads' - as 'constantly 4' or as 'up to 4'. i'd assume it depends on the data types a SIMD unit is crunching on - if it's 8bit scalars then it can do up to 4 of those per shader unit, 8 altogether. if it's sp floats then, yes - that's as many as 1 per unit, 2 in total. apparently i'm trying to make sense of intel's datasheet here.