It is currently Wed Jun 19, 2013 9:58 pm

Direct (close-to-the-metal) open-source SGX driver

View active topics

All times are UTC


Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 207 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 11  Next
Author Message
 PostPosted: Tue Oct 07, 2008 3:59 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
Exophase wrote:
Yes it is a lie, it's ImgTech's lie. A lie that you weren't aware of, apparently.

You seem rather direct in jumping into conclusions... maybe they calculate it for mono textures ? :) Remember, we're talking marketing figures, and as I wrote before, to be treated with grin of salt.
But it doesn't mean that you have to be rude... but maybe it's just my good intentions coming out. Apparently not all of us have those...

Exophase wrote:
http://www.gp32x.com/board/index.php?showtopic=42277&st=75&p=617065&#entry617065

That's a great link, thx. If you've put this before - I wouldn't make fun of you. :)

Exophase wrote:
so please drop this nonsense about me building "guesswork on guesswork", you're being all kinds of cocky.

Yes I am. And proud of it. :D

If somebody wants to use DSP for doing graphics - it's their right. But if it would be so optimal - there would be no need to put SGX into OMAP, right ?
Let's use everything to the maximum of its capabilities that means - DSP and SGX - and then apps should shine. If it finds out that DSP is better for graphics - let's use them.

But for haven's sake - let's be positive about the whole process. It's not life-or-death, it's grown-ups playing with their new toy. In-your-face attitude and rude behavior is what deters everybody from doing anything.

And if somebody is interested what the hell I'm doing all the time on the Pandora Forum - I'm sick, in bed, antibiotics in my veins, so... :wacko:


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 4:05 pm   
Guru

Joined: Sat Oct 04, 2008 9:20 pm
Posts: 170
There's a serious question as to what "concurrency" means here, and I think that ImgTech has been using creative language to try to misrepresent what this means. Because I think that what it really means is that four threads exist simultaneously and can be switched between with zero overhead, not that they EXECUTE simultaneously. By having two execution pipelines that clearly suggests that it can do 1 operation per cycle per USSE, not 4. Meaning 2 FMACs per cycle, and "SIMD" only refers to 4-way 8bit or 2-way 16bit, which is nice for color effects but useless for floating point operations (ie, no single cycle dot product).


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 4:11 pm   
User avatar

Joined: Fri Oct 03, 2008 8:43 pm
Posts: 235
I hope your project goes well, the community will be grateful. May your health get well soon, too :)


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 4:13 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
blu wrote:
well, my version is a bit different.

the documents mentions 'two pipelines' which i believe constitute:

2 shader units, 2 tmu's, 2 ROPs, configured as 1x shader + 1x tmu + 1x ROP per 'pipeline'.

those 2 shader units are SIMD, and can be fed from 4 thread contexts, ergo the mentioning of '4 concurrent threads' (9.1.2 Shading Engine Key Features), which threads are picked from a queue of 16 threads by the scheduler, ergo the 16 hw context maintained, each with 128 (32bit) registers, amouting to 2048 registers in total.

the 1.2 gpixel/s is, as already mentioned by Exophase, a result of factorization by 3 to account for TBDR's efficiency at overdraw (3 is a reasonable overdraw to consider), but that brings nothing to the discussion at hand.


Let's find out if I understand you correctly. There is 16 contexts, and scheduler picks 4 and passes it to 2 shader as 2 pairs units.

If it would be so - each shader unit should be able to process 2 floating point variables at time, so there should be 64-bit SIMD registers...

Or I don't understand something ?


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 4:13 pm   

Joined: Mon Oct 06, 2008 2:19 pm
Posts: 67
Exophase wrote:
Orders of magnitude? Would you like to tell me what a 32bit SIMD scalar processor can do in one cycle that's ORDERS OF MAGNITUDE more efficient than 2-way FPU SIMD that NEON can do?

what does it matter what it can do in one cycle? it cannot do zilch if it cannot feed those ALUs. for tens or hundreds of cycles, at a time.

Quote:
Yes, the hardware threads hides latency but there are other mechanisms to hide latency even on CPUs, such as prefetching. They just take more work.

'just more work'? please, be reasonable. also, feel free to show me an equally-clocked, equally configued cpu SIMD unit and a GPU, that don't show an order of magnitude discrepancy in performance, even before considering GPU-specific hw like tmu's (though according to you, thouse should do fine with manual 'prefetching', right?)

Quote:
On the other hand, the CPU has 256KB of L2 cache that is quite fast, although we don't know how much cache the GPU has.

true, we don't. we just know the GPU has its own caching (9.1.6 Multi Level Cache)

Quote:
It's much more flexible. Go read the documentation. Can issue 8 execution units per cycle, which basically addresses 4 units with 2x redundancy each, but can do many similar ALU operations over nearly all of them. And of course they also have prefetching and decent caches. I'd like to know why you think the USSEs have such an amazing instruction set. 16-way threads are nice, but the USSEs were obviously made to be scaled, with the SGX 530/535 being roll out parts. The newer SGX's already have more USSEs.

*takes note to read the documentation of an amazing DSP capable of beating a GPU at its own turf*

Quote:
OR it could be that the highest end PowerVR chip available has shaders anyway, or it could be that you can't do pixel shading using a DSP... take your pick? Your sentence suggests that the SGX is only good for its shaders. I entirely expect that some people will use NEON for transformation and lighting, especially if they want to maximize pixel shader computational throughput.

sorry, a DSP can't do pixel shaders now? why not?

ok, i've got to go to lunch now, will be back in an hour to continue with this amusing discussion.


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 4:29 pm   
Guru

Joined: Sat Oct 04, 2008 9:20 pm
Posts: 170
... okay, this is getting frustrating. Will people please stop misconstruing every last thing I say here?

I am NOT trying to say that the DSP is better than the SGX. I'm saying that the DSP is better at doing GENERAL PURPOSE operations than the SGX is. When I say that a DSP can't do pixel shaders I meant that it can't do them while still having the SGX rasterize said pixels, this is for obvious reasons. Once again, of course I am not suggesting to use the DSP to do 3D graphics instead of the SGX. I am suggesting to do it instead of using the USSEs on the SGX to perform general purpose operations. Do you get it now? No, I'm not going to show how an equally clocked CPU + SIMD unit can beat a GPU at 3D graphics, but that's not what we're talking about. We're talking about NEON @ up to 900MHz which is 64bit FPU SIMD (with 128bit registers, btw) vs USSEs x2 at up to 200MHz which can do 32bit SIMD or scalar FPU. Or an 8-way VLIW DSP at 430MHz. Please tell me that you understand that I'm not going to so easily believe that symmetric multi-threading alone is going to push the former over the latter, especially given the NEON unit's provisions for prefetching, streaming and zero load-use penalty.

Yes, I understand that hiding latency is important, but I think it's being heavily exaggerated here. Furthermore, so is the notion that an L2 cache miss will necessarily take thousands of cycles on a device with these clocks.


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 5:03 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
Exophase wrote:
I am NOT trying to say that the DSP is better than the SGX. I'm saying that the DSP is better at doing GENERAL PURPOSE operations than the SGX is.

I agree. :)

Exophase wrote:
I am suggesting to do it instead of using the USSEs on the SGX to perform general purpose operations.

If somebody thought I was suggesting this - I'm sorry. This was not the point - I want to use SGX for graphics-related tasks only. At the other hand - it can be done, but it probably would not be as efficient as DSP.

Exophase wrote:
Yes, I understand that hiding latency is important, but I think it's being heavily exaggerated here. Furthermore, so is the notion that an L2 cache miss will necessarily take thousands of cycles on a device with these clocks.

Hiding latency is sole purpose of threading in GFX HW.
For example - for GMA945 (drivers freely available for Linux) cache miss for texture access is involved means 600+clocks penalty. If because exaggerated register usage number of spawned threads is halved (each execution unit is busy, but it's waiting on samplers, memory accesses etc. and cannot switch to next thread) - the performance is almost halved. And yes - I'm sure of it, I've seen it, and fixed it in some occasions.

So if everything is explained, water is calm again, and this is thread about reverse-engineering SGX - let's get back to meritum.

...in another words - let reverse-engineering commence. :wink:


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 5:39 pm   

Joined: Fri Oct 03, 2008 2:09 am
Posts: 54
in light of your last sentence there... do you have a beagle board?

@exophase we someone had to fully understand the hardware and maybe even more to write all these emulators ... and surely the documentation is closed on many if not nearly all of them


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 5:49 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
cb88 wrote:
in light of your last sentence there... do you have a beagle board?


Nope, I don't. :( But buying one now would be an overkill - I found no b.b. dealer in Poland and I'd have to import it... I think I'll wait for my Pandora to arrive.

BTW - Do you have OGL ES driver in the Linux distro for beagle board ?


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 7:38 pm   

Joined: Mon Oct 06, 2008 11:26 pm
Posts: 2
Quote:
BTW - Do you have OGL ES driver in the Linux distro for beagle board


What Linux distro ??? there is just the Ångström... and that doesnt have any OGL driver :(


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 7:44 pm   
Guru

Joined: Sat Oct 04, 2008 9:20 pm
Posts: 170
cb88 wrote:
@exophase we someone had to fully understand the hardware and maybe even more to write all these emulators ... and surely the documentation is closed on many if not nearly all of them


Actually, most console information is derived from leaked documents. And for anything older than PS2 (and GBA and DS as well) games interfaced directly with the hardware. This has the consequences of: a) hardware has simpler interfaces b) many many games exist which you can reverse engineer instead of having to reverse engineer the hardware alone c) good documentation had to exist.

Besides, highly indepth console documentation that does involve reverse engineering takes years to write and involves many people, and then people write emulators based off of this information. And this is still after a great wealth of information existed due to developers having had that information given to them by the company who made the console. Sometimes the two groups overlap, but most emulator authors are not finding out about a console themselves.

It's really nothing like reverse engineering a modern video card.

But you might have missed the post where I said that I wasn't discouraging him and hope that he succeeds.


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 8:05 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
Alphacore wrote:
What Linux distro ??? there is just the Ångström... and that doesnt have any OGL driver :(


I've heard that Ångström will be pre-installed on Pandora... Does this mean that there will be no OpenGL ES (even close-sourced) SGX driver in Pandora ?


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 8:22 pm   

Joined: Mon Oct 06, 2008 2:19 pm
Posts: 67
ok, i realize i may have got slightly carried away in the argument with Exophase - i see now he was arguing from the POV of 'general general' computing, while i was talking of the 'GP GPU' computing (where tasks are still GPU-friendly, just don't produce fancy pictures on screen). of course a cpu SIMD/DSP would be better when used for its strengths (higher clocks, coupled with good cache-ability/predictable access patterns, etc), and the GPU - for its massive-multi-threading strenghts, and efficiency will be always high, everybody will be happy, ect*

Exophase wrote:
There's a serious question as to what "concurrency" means here, and I think that ImgTech has been using creative language to try to misrepresent what this means. Because I think that what it really means is that four threads exist simultaneously and can be switched between with zero overhead, not that they EXECUTE simultaneously. By having two execution pipelines that clearly suggests that it can do 1 operation per cycle per USSE, not 4. Meaning 2 FMACs per cycle, and "SIMD" only refers to 4-way 8bit or 2-way 16bit, which is nice for color effects but useless for floating point operations (ie, no single cycle dot product).

well, a SIMD does not need be fed from the same context - it just needs to perform the same op over the same data types, that's all. where the data comes from is of no concert to the SIMD unit, and in this case i believe SGX feeds its two shader units from (up to) 4 thread contexts, from a pool of 16. if that's the case, it'd be fair to say that you have (up to) '4 concurrent threads in flight' at each and every moment - it's not just PR talk.

* though, it'd be curious to note that in practice, the OMAP3 seems to be present one 'conflict of interests' between DSP and GPU - apparently TI are relying on their DSP for their video codec needs, which the SGX is, at least on paper, capable of handling too, but is shunned by TI. anyway, that's just a musing to the topic at hand.

maciek_urbanski wrote:
Let's find out if I understand you correctly. There is 16 contexts, and scheduler picks 4 and passes it to 2 shader as 2 pairs units.

If it would be so - each shader unit should be able to process 2 floating point variables at time, so there should be 64-bit SIMD registers...

Or I don't understand something?

well, it depens how you interpert the '4 concurrent threads' - as 'constantly 4' or as 'up to 4'. i'd assume it depends on the data types a SIMD unit is crunching on - if it's 8bit scalars then it can do up to 4 of those per shader unit, 8 altogether. if it's sp floats then, yes - that's as many as 1 per unit, 2 in total. apparently i'm trying to make sense of intel's datasheet here.


Last edited by blu on Wed Oct 08, 2008 3:29 am, edited 1 time in total.

Top
 Profile  
 PostPosted: Tue Oct 07, 2008 8:40 pm   

Joined: Fri Oct 03, 2008 2:09 am
Posts: 54
although i don't think the beagle board has a released driver there is one... see the tech demos

also the pandora devs have said that the pandora will ship with ogl 2.0 es at least


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 9:00 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
blu wrote:
maciek_urbanski wrote:
Let's find out if I understand you correctly. There is 16 contexts, and scheduler picks 4 and passes it to 2 shader as 2 pairs units.

If it would be so - each shader unit should be able to process 2 floating point variables at time, so there should be 64-bit SIMD registers...

Or I don't understand something?

well, it depens how you interpert the '4 concurrent threads' - as 'constantly 4' or as 'up to 4'. i'd assume it depends on the data types a SIMD unit is crunching on - if it's 8bit scalars then it can do up to 4 of those per shader unit. if it's sp floats then, yes - that's as many as 2. apparently i'm trying to make sense of intel's datasheet here.


If you find any hole in my logic - it would be great.

So - here it goes:

Fill rate is limited by computational performance, parallelism, but primarily by interface. So my guess is that 'pixel coprocessor' here(link) has only 64-bit bus to shader units - hence the two-pixel limitation (I'm still betting that everything that shader outputs has a float type).

In SCH doc there is explicit information that SIMD registers are 32-bit wide, and there is 128 of them per thread.

There is problem with 'threads', because every GPU company defines them differently. In most Intel docs I've read (about G945) thread had CPU-world meaning (program with its context).

So if they write that 4 concurrent threads can be executed I understand that 4 threads (from pool of 16) are executed at the given time, each being a code block using separate pool of 128 32-bit registers. Each of those registers can be treated either as float, two 16-bit words, or four bytes.
If any of those 4 HW units will execute instruction that will take long time (memory access, texture sampling, synchronization) it will change its internal context to one of the 12 waiting threads. This way HW will be busy most of the time.

But maybe I'm channeling G945 design here, because I've studied docs Intel provided for quite some time...


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 9:03 pm   

Joined: Thu Oct 02, 2008 9:02 pm
Posts: 53
Location: Los Angeles
less time arguing, more time coding. chop chop!! :P


Top
 Profile WWW  
 PostPosted: Tue Oct 07, 2008 9:04 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
cb88 wrote:
although i don't think the beagle board has a released driver there is one... see the tech demos

Yup, I've seen those. It is said that this driver will have open sourced kernel module, and closed-source binary user space module... (mixed-mode driver... exactly like Vista :lol:)
But I cannot find download anywhere. There's no sign of this driver in -omap1 branch of Linux kernel...

If someone knows where to get it from - please let me know.


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 10:01 pm   

Joined: Mon Oct 06, 2008 2:19 pm
Posts: 67
maciek_urbanski wrote:
There is problem with 'threads', because every GPU company defines them differently. In most Intel docs I've read (about G945) thread had CPU-world meaning (program with its context).

So if they write that 4 concurrent threads can be executed I understand that 4 threads (from pool of 16) are executed at the given time, each being a code block using separate pool of 128 32-bit registers. Each of those registers can be treated either as float, two 16-bit words, or four bytes.
If any of those 4 HW units will execute instruction that will take long time (memory access, texture sampling, synchronization) it will change its internal context to one of the 12 waiting threads. This way HW will be busy most of the time.

well, my understanding does not differ from yours, with one remark - SIMD units don't need to be bound each to a single context - that be sheer under-utilization. as all those 16 context are actually present 'on-board', they are all 'zero-switch' ones - i.e. whatever combination of them the scheduler decides to feed to the SIMDs - it's all good - there are no penalties. generally, all it takes for a combination of contexts to make it together for execution is:
a) they all have the same ALU (u)op pending
b) the oprands of that op are all of the same data type

if the above can be dubbed as 'gathering SIMD ops', the same holds true for 'scattering SIMD ops', where the ALU op from one context gets spread across multiple shader units.

phrosty wrote:
less time arguing, more time coding. chop chop!! :P

i'd gladly get down and dirty if you help me get my pandora ahead of schedule :P


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 10:27 pm   
Guru

Joined: Sat Oct 04, 2008 9:20 pm
Posts: 170
dmdm's further post here:

http://www.gp32x.com/board/index.php?sh ... 77&st=113#

Suggests 1 operation per cycle, not 4 (which is what you guys are saying, and of course what the referenced materials imply, but they're quite vague on details). Hence my questioning of what "concurrent" really means. Also, I'm pretty skeptical that "SIMD" means anything more than 4-way 32bit operations that execute on a single unit as a single operation. Although I realize the term has been misused already by other GPUs I really doubt that it means more than this.

He also says that even more than one FP op can be done per cycle, but then goes on to use an FMAC as an example of this, so I doubt that this has any deep meaning other than the usual fused mul/add unit.


Top
 Profile  
 PostPosted: Tue Oct 07, 2008 11:23 pm   
Guru
User avatar

Joined: Fri Oct 03, 2008 5:54 pm
Posts: 85
blu wrote:
well, my understanding does not differ from yours, with one remark - SIMD units don't need to be bound each to a single context - that be sheer under-utilization. as all those 16 context are actually present 'on-board', they are all 'zero-switch' ones - i.e. whatever combination of them the scheduler decides to feed to the SIMDs - it's all good - there are no penalties.

Yup, I thought the same, maybe I wasn't clear on that.

blu wrote:
generally, all it takes for a combination of contexts to make it together for execution is:
a) they all have the same ALU (u)op pending
b) the oprands of that op are all of the same data type

...but i disagree on this part. If execution unit/shader unit is processing only 32-bit at time (one float for example) how we can bundle more than one thread to it ?
I think you're suggesting that there is 'wide SIMD' with more than one float-in-register (like 3DNow!, SSE, etc.) so each register can process multiple threads in one op-cycle.
I suggest that there are 4 independent (executing different code) 'thin SIMD' units (and calling them SIMD is a overstatement) that can process only 1 float at time. The SIMD-part of their name comes from fact that they can process 2 words or 4 bytes at time (which in 'modern' shaders happens rarely). This approach would enable executing completely different shaders concurrently (VS,PS,GS) and not have such strong requirements as you cited in a) & b). But again - I might be channeling other architecture, because G945 works as I described, so I might project it on SGX...

blu wrote:
phrosty wrote:
less time arguing, more time coding. chop chop!! :P

i'd gladly get down and dirty if you help me get my pandora ahead of schedule :P

I second that! :D


Top
 Profile  
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 207 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 11  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron


Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Style originally created by Volize © 2003 • Redesigned SkyLine by MartectX © 2008