
Short status update from me for all who still don't know:
I'm working on my diploma thesis this year. My day to day work is making the TPC Trackfinder software for the Alice detector of the LHC as fast as possible. (TPC Images) Target hardware is Multicore x86(_64) or GPUs. It's a challenging task, and I enjoy getting more experience in the HPC sector.
What I wanted to point everybody at, though, is that Intel has released the intrinsics it will be supporting with the Larrabee with the last gaming conference. What I didn't notice until today is that they also released a complete header that allows you to program with those intrinsics now. At home...
The header provides a scalar and an SSE implementation of the full set of Larrabee intrinsics. Once the Larrabee and its development tools will be available your program will then run on LRB and be able to use the vector instructions in no time.
If you don't know why LRB vector instructions are so cool let me tell you: SSE is nice. You can do four floating point instructions in one (or int, or two doubles...). With LRB the vector width is four times bigger: 16 floats/ints, 8 doubles. But the LRB instructions are a lot nicer than SSE. You have a 16/8 bit mask available to select the entries of the vector the instruction should write. You have all arithmetic instructions for float, int and double available (SSE 4.1 finally brought the multiply instruction for int). You have gather/scatter instructions that make it easy to access data that is stored in structs. You have free conversions and swizzles in the loads and stores. (e.g. you can store data as half-floats and compute as floats, halfing the I/O bandwidth your code needs)
Now, intrinsics are already a lot nicer than writing inline assembly. But in the end you want to have a C++ class for this, right? If you do let me know. 