
When programming with SSE there are some cases where you need a double-quad vector filled with all 1s. E.g. for a simple bitwise not:
_mm_andnot_ps(x, 1);
(This does a (~x & ~0) which of course is equal to ~x, the bitwise not we were looking for.) The bitwise not is necessary to implement some of the missing comparisons. SSE2 only has integer comparisons for ==, < and >. To implement !=, <= and >= you need a bitwise not.
So now that I motivated your need for a double quad with all 1s, where do you get it from? Well, easy, you say. Put a constant there. OK, it's 128 bits big, but what does that matter. And you're right, except if the constant is not in the L1 cache. Because then the load of the constant from L2 cache will introduce some unnecessary latency. So yes, it's a solution but not the nicest one.
Here's a better one. Remember that a comparison in SSE gives you a full double-quad. And if the comparison of the entries in the vector says they were all equal (for cmpeq, that is) you get a double-quad filled with 1s. Great... let's do it:
static inline __m128 _my_setallone() {
__m128 r;
return _mm_cmpeq_ps(r, r);
}
Works. Nice... except gcc warns about an uninitialized variable r being used. Looking at the generated code we see it added another xor instruction to initialize the register to 0. Ugh.
I can do better than two instructions. I want one! So inline assembly it is:
static inline __m128 _my_setallone() {
__m128 r;
__asm__ ("cmpeqps %0,%0":"=x"(r)::);
return r;
}
You'd expect that to work? I sure did. And sometimes it did. Sometimes it didn't. Huh? Guessing... put a __volatile__ there: Less failures. But not cured. So remove the __volatile__ again and objdump -d the binary while doing instruction steps in gdb in a split view in Konsole (nice feature there!). And what do I see. After cmpeqps was called on a register that register is all 0s! Why oh why? How can the same thing not be equal. So I look at the value of the register before the call: 4 NaNs (at least when interpreted as 4 floats). I slap my head, remember that NaNs are never equal to anything, and look for the integer comparison instruction.
The result:
static inline __m128i _my_setallone() {
__m128i r;
__asm__ ("pcmpeqb %0,%0":"=x"(r)::);
return r;
}
Perfect. Now it works. I am able to do a bitwise not in two instructions (gcc is able to keep the xmm register around for another pandn, so this is also as good as it can get) and I don't have to worry about the cache at all.
Lesson learned: don't mess with floating point instructions when trying to do bit magic.