Thursday, May 7. 2009Do you need 128 1s?Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
On AMD "pcmpeqb/d" uses an extra instruction and blocks the fadd pipe for 1 cycle, so a load from cached memory will be faster. Address generation on AMD is free. Intel Core has 3 ports that can handle "pcmpeqb/d", so it is less likely that this will cause a delay, although one could probably write (likely nonsense) code that executes faster with a load. Use a recently used scratch register to avoid a register read stall. But I would assume the pcmpeqb/d to still be faster than a L2 load, no? L1 load should be faster than pcmpeqb/d on both Intel and AMD AFAIU. Regarding the register read stall: The way the code above is written gcc decides what register it uses. Though perhaps gcc gets fooled by r not being in the read list of the asm call? I think, you'll get the fastest code by declaring a global register variable, which would also be architecture oblivious. If you are lucky, GCC will put it in e.g. xmm15 once and call it a day. Can you post what you want the function to do? I love doing this kind of puzzle. static inline VectorType cmpneq(VectorType a, VectorType b) { return _mm_andnot_si128(cmpeq(a, b), _my_setallone_si128()); } static inline VectorType cmpnlt(VectorType a, VectorType b) { return _mm_andnot_si128(cmplt(a, b), _my_setallone_si128()); } static inline VectorType cmple (VectorType a, VectorType b) { return _mm_andnot_si128(cmpgt(a, b), _my_setallone_si128()); } And in the end used to implement operator==, !=, >, >=, <, and <= for a SSE::Vector<int> class. Declaring a constant __m128 poses an interesting problem, though. So far the only working code I know of is to either put an array of 4 ints aligned to a 16 byte boundary and call _mm_load everytime you want to use it, or use a union like static const union { unsigned int m[4]; __m128 v; } ALLONE = { { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF } }; Both is ugly, but both work alright. static inline VectorType cmple (VectorType a, VectorType b) { return _mm_andnot_si128(cmpgt(b, a); } In the worst case this generates an extra movdqa (if the input needs to be preserved). I haven't thought about it too much, but you can probably eliminate every cmpneq, simply by replacing following pandn with pand and vice versa. You should be able to declare a global register variable like this: register __m128i ones asm("xmm15")=_mm_set1_epi32(0xffffffff); But it's probably better to make it a static member. static inline VectorType cmple (VectorType a, VectorType b) { return _mm_cmpgt(b,a); } |
Calendar
QuicksearchCategoriesBlog Administration |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||