<?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:admin="http://webns.net/mvcb/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
   xmlns:content="http://purl.org/rss/1.0/modules/content/"
   >
<channel>
    <title>/Kretz/blog/ - Programming</title>
    <link>http://vir.homelinux.org/blog/</link>
    <description>Things we want to tell the world...</description>
    <dc:language>en</dc:language>
    <generator>Serendipity 1.2.2 - http://www.s9y.org/</generator>
    <pubDate>Sat, 09 May 2009 11:32:03 GMT</pubDate>

    <image>
        <url>http://vir.homelinux.org/blog/templates/default/img/s9y_banner_small.png</url>
        <title>RSS: /Kretz/blog/ - Programming - Things we want to tell the world...</title>
        <link>http://vir.homelinux.org/blog/</link>
        <width>100</width>
        <height>21</height>
    </image>

<item>
    <title>Do you need 128 1s?</title>
    <link>http://vir.homelinux.org/blog/archives/131-Do-you-need-128-1s.html</link>
            <category>Programming</category>
    
    <comments>http://vir.homelinux.org/blog/archives/131-Do-you-need-128-1s.html#comments</comments>
    <wfw:comment>http://vir.homelinux.org/blog/wfwcomment.php?cid=131</wfw:comment>

    <slash:comments>8</slash:comments>
    <wfw:commentRss>http://vir.homelinux.org/blog/rss.php?version=2.0&amp;type=comments&amp;cid=131</wfw:commentRss>
    

    <author>nospam@example.com (Matthias Kretz)</author>
    <content:encoded>
    &lt;img class=&quot;serendipity_authorpic&quot; src=&quot;http://vir.homelinux.org/blog/templates/default/img/Matthias_Kretz.png&quot; alt=&quot;Author&quot; title=&quot;Matthias Kretz&quot; /&gt;&lt;p&gt;When programming with SSE there are some cases where you need a double-quad vector filled with all 1s. E.g. for a simple bitwise not:&lt;div class=&quot;c&quot; style=&quot;text-align: left&quot;&gt;_mm_andnot_ps&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;x, &lt;span style=&quot;color: #cc66cc;&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt;;&lt;br /&gt;&amp;#160;&lt;/div&gt;
(This does a (~x &amp;amp; ~0) which of course is equal to ~x, the bitwise not we were looking for.) The bitwise not is necessary to implement some of the missing comparisons. SSE2 only has integer comparisons for ==, &lt; and &gt;. To implement !=, &lt;= and &gt;= you need a bitwise not.&lt;/p&gt;
&lt;p&gt;So now that I motivated your need for a double quad with all 1s, where do you get it from? Well, easy, you say. Put a constant there. OK, it&#039;s 128 bits big, but what does that matter. And you&#039;re right, except if the constant is not in the L1 cache. Because then the load of the constant from L2 cache will introduce some unnecessary latency. So yes, it&#039;s a solution but not the nicest one.&lt;/p&gt;
&lt;p&gt;Here&#039;s a better one. Remember that a comparison in SSE gives you a full double-quad. And if the comparison of the entries in the vector says they were all equal (for cmpeq, that is) you get a double-quad filled with 1s. Great... let&#039;s do it:&lt;div class=&quot;c&quot; style=&quot;text-align: left&quot;&gt;&lt;span style=&quot;color: #993333;&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: #000000; font-weight: bold;&quot;&gt;inline&lt;/span&gt; __m128 _my_setallone&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt; &lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#123;&lt;/span&gt;&lt;br /&gt;&amp;#160; __m128 r;&lt;br /&gt;&amp;#160; &lt;span style=&quot;color: #b1b100;&quot;&gt;return&lt;/span&gt; _mm_cmpeq_ps&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;r, r&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt;;&lt;br /&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#125;&lt;/span&gt;&lt;br /&gt;&amp;#160;&lt;/div&gt;
Works. Nice... except gcc warns about an uninitialized variable r being used. Looking at the generated code we see it added another xor instruction to initialize the register to 0. Ugh.&lt;/p&gt;
&lt;p&gt;I can do better than two instructions. I want one! So inline assembly it is:&lt;div class=&quot;c&quot; style=&quot;text-align: left&quot;&gt;&lt;span style=&quot;color: #993333;&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: #000000; font-weight: bold;&quot;&gt;inline&lt;/span&gt; __m128 _my_setallone&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt; &lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#123;&lt;/span&gt;&lt;br /&gt;&amp;#160; __m128 r;&lt;br /&gt;&amp;#160; __asm__ &lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;&lt;span style=&quot;color: #ff0000;&quot;&gt;&quot;cmpeqps %0,%0&quot;&lt;/span&gt;:&lt;span style=&quot;color: #ff0000;&quot;&gt;&quot;=x&quot;&lt;/span&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;r&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt;::&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt;;&lt;br /&gt;&amp;#160; &lt;span style=&quot;color: #b1b100;&quot;&gt;return&lt;/span&gt; r;&lt;br /&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#125;&lt;/span&gt;&lt;br /&gt;&amp;#160;&lt;/div&gt;
You&#039;d expect that to work? I sure did. And sometimes it did. Sometimes it didn&#039;t. Huh? Guessing... put a __volatile__ there: Less failures. But not cured. So remove the __volatile__ again and objdump -d the binary while doing instruction steps in gdb in a split view in Konsole (nice feature there!). And what do I see. After cmpeqps was called on a register that register is all 0s! Why oh why? How can the same thing not be equal. So I look at the value of the register before the call: 4 NaNs (at least when interpreted as 4 floats). I slap my head, remember that NaNs are never equal to anything, and look for the integer comparison instruction.&lt;/p&gt;
&lt;p&gt;The result:&lt;div class=&quot;c&quot; style=&quot;text-align: left&quot;&gt;&lt;span style=&quot;color: #993333;&quot;&gt;static&lt;/span&gt; &lt;span style=&quot;color: #000000; font-weight: bold;&quot;&gt;inline&lt;/span&gt; __m128i _my_setallone&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt; &lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#123;&lt;/span&gt;&lt;br /&gt;&amp;#160; __m128i r;&lt;br /&gt;&amp;#160; __asm__ &lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;&lt;span style=&quot;color: #ff0000;&quot;&gt;&quot;pcmpeqb %0,%0&quot;&lt;/span&gt;:&lt;span style=&quot;color: #ff0000;&quot;&gt;&quot;=x&quot;&lt;/span&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#40;&lt;/span&gt;r&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt;::&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#41;&lt;/span&gt;;&lt;br /&gt;&amp;#160; &lt;span style=&quot;color: #b1b100;&quot;&gt;return&lt;/span&gt; r;&lt;br /&gt;&lt;span style=&quot;color: #66cc66;&quot;&gt;&amp;#125;&lt;/span&gt;&lt;br /&gt;&amp;#160;&lt;/div&gt;
Perfect. Now it works. I am able to do a bitwise not in two instructions (gcc is able to keep the xmm register around for another pandn, so this is also as good as it can get) and I don&#039;t have to worry about the cache at all.&lt;/p&gt;
&lt;p&gt;Lesson learned: don&#039;t mess with floating point instructions when trying to do bit magic.&lt;/p&gt; 
    </content:encoded>

    <pubDate>Thu, 07 May 2009 12:58:00 +0200</pubDate>
    <guid isPermaLink="false">http://vir.homelinux.org/blog/archives/131-guid.html</guid>
    
</item>
<item>
    <title>Programming for Larrabee</title>
    <link>http://vir.homelinux.org/blog/archives/130-Programming-for-Larrabee.html</link>
            <category>Programming</category>
    
    <comments>http://vir.homelinux.org/blog/archives/130-Programming-for-Larrabee.html#comments</comments>
    <wfw:comment>http://vir.homelinux.org/blog/wfwcomment.php?cid=130</wfw:comment>

    <slash:comments>10</slash:comments>
    <wfw:commentRss>http://vir.homelinux.org/blog/rss.php?version=2.0&amp;type=comments&amp;cid=130</wfw:commentRss>
    

    <author>nospam@example.com (Matthias Kretz)</author>
    <content:encoded>
    &lt;img class=&quot;serendipity_authorpic&quot; src=&quot;http://vir.homelinux.org/blog/templates/default/img/Matthias_Kretz.png&quot; alt=&quot;Author&quot; title=&quot;Matthias Kretz&quot; /&gt;&lt;p&gt;Short status update from me for all who still don&#039;t know:&lt;br/&gt;
I&#039;m working on my diploma thesis this year. My day to day work is making the TPC Trackfinder software for the &lt;a href=&quot;http://aliceinfo.cern.ch/Collaboration/&quot;&gt;Alice detector of the LHC&lt;/a&gt; as fast as possible. (&lt;a href=&quot;http://images.google.com/images?q=alice+tpc&quot;&gt;TPC Images&lt;/a&gt;) Target hardware is Multicore x86(_64) or GPUs. It&#039;s a challenging task, and I enjoy getting more experience in the HPC sector.&lt;/p&gt;
&lt;p&gt;What I wanted to point everybody at, though, is that &lt;a href=&quot;http://feedproxy.google.com/~r/ISNMulticore/~3/yTysBCD2Hrs/prototype-primitives-guide&quot;&gt;Intel has released the intrinsics&lt;/a&gt; it will be supporting with the Larrabee with the last gaming conference. What I didn&#039;t notice until today is that they also released a &lt;a href=&quot;http://software.intel.com/file/15165&quot;&gt;complete header&lt;/a&gt; that allows you to program with those intrinsics now. At home... &lt;img src=&quot;http://vir.homelinux.org/blog/templates/default/img/emoticons/smile.png&quot; alt=&quot;:-)&quot; style=&quot;display: inline; vertical-align: bottom;&quot; class=&quot;emoticon&quot; /&gt; The header provides a scalar and an SSE implementation of the full set of Larrabee intrinsics. Once the Larrabee and its development tools will be available your program will then run on LRB and be able to use the vector instructions in no time.&lt;/p&gt;
&lt;p&gt;If you don&#039;t know why LRB vector instructions are so cool let me tell you: SSE is nice. You can do four floating point instructions in one (or int, or two doubles...). With LRB the vector width is four times bigger: 16 floats/ints, 8 doubles. But the LRB instructions are &lt;strong&gt;a lot&lt;/strong&gt; nicer than SSE. You have a 16/8 bit mask available to select the entries of the vector the instruction should write. You have all arithmetic instructions for float, int and double available (SSE 4.1 finally brought the multiply instruction for int). You have gather/scatter instructions that make it easy to access data that is stored in structs. You have free conversions and swizzles in the loads and stores. (e.g. you can store data as half-floats and compute as floats, halfing the I/O bandwidth your code needs)&lt;/p&gt;
&lt;p&gt;Now, intrinsics are already a lot nicer than writing inline assembly. But in the end you want to have a C++ class for this, right? If you do let me know. &lt;img src=&quot;http://vir.homelinux.org/blog/templates/default/img/emoticons/smile.png&quot; alt=&quot;:-)&quot; style=&quot;display: inline; vertical-align: bottom;&quot; class=&quot;emoticon&quot; /&gt;&lt;/p&gt; 
    </content:encoded>

    <pubDate>Wed, 22 Apr 2009 12:46:00 +0200</pubDate>
    <guid isPermaLink="false">http://vir.homelinux.org/blog/archives/130-guid.html</guid>
    
</item>

</channel>
</rss>