Skip to content

Rico Mariani's Performance Tidbits
Syndicate content
Implying no warranties and conferring no rights: "AS IS" since 1988
Updated: 41 min 10 sec ago

Non-Properties of floating point numbers

Thu, 10/16/2014 - 05:46

I recently received a customer question that boiled down to the oft-encountered binary floating point inexact representation issue.  They were rather troubled that basic identities normal to numbers did not apply to floating point arithmetic. 

Learning that floating point is kind of messed up is kind of like finding out there is no Santa Claus.  Sorry if I just spoiled that for you :)

It made me think about some of my undergraduate math, and so I decided to illustrate just how messed up floating point numbers are.

What are some of the properties of floating point math compared to normal numbers?  Do they form a "Group"?

http://en.wikipedia.org/wiki/Group_(mathematics)#Definition

Let's try:

---------------
Closure

For all a, b in G, the result of the operation, a + b, is also in G.

Well, no, we lose already because floating point numbers can overflow.

From my javascript console:

>1e308
1e+308

>1e308+1e308
Infinity

Oops.  OK but nevermind that, it's mostly closed right?  I mean overflow hardly ever happens.  Let's trudge on bravely.

---------------
Associativity

For all a, b and c in G, (a + b) + c = a + (b + c).

That looks harmless enough right?  I can just reorder the parens, what could possibly go wrong?

>(1+1e200)-1e200
0

>1+(1e200-1e200)
1

Well... drat.

------------------
Identity element

There exists an element e in G, such that for every element a in G, the equation e + a = a + e = a holds. Such an element is unique, and thus one speaks of the identity element.

OK this one is really easy right?

All I need is a unique identity element, that's the "zero" and I can add zero and get the same thing back. 

>1+0
1

Great!  But wait...

>1+1e-100
1

And 1e-100 isn't another name for zero...

>1e-100
1e-100

Oops...

------------------
Inverse element

For each a in G, there exists an element b in G such that a + b = b + a = 0

OK on this one I think we're good.  If I started with a number I can represent in floating point, x, then there is exactly one number I can add to x which will give me zero and that is -x.  Now this may seem a little like I'm cheating because

>1.1111111111111111 - 1.111111111111111111111111111111
0

Seems to indicate that there is another number that I could subtract to get zero.

However...

>1.111111111111111111111111111111
1.1111111111111111

As we can see 1.111111111111111111111111111111 isn't a valid floating point number on my system so it's not fair to try to use it.  It's automatically converted to the best approximation which is 1.1111111111111111.

Now, anyone want to tell me why addition of floating point numbers is necessarily commutative?

 

Categories: Blogs

Career advice for anyone who cares to listen :)

Fri, 10/10/2014 - 03:06

At the Grace Hopper Celebration of Women in Computing there are many attendees offering and looking for good advice.  Dear friends among them.  I read some, “inarticulate” responses earlier today and they prompted me to think of what advice I might offer.  As I considered this I found myself thinking the same thoughts I always think when counselling people on how to have a good career. And so I offer these tidbits, trite as they may appear, in good faith, to all genders, religions, orientations, ages, and other demographics equally. I do this partly because it seems timely but also partly because I promised Emma Watson I would do something in the spirit of #heforshe even if she didn’t exactly hear me make that promise.

1. Don’t compromise

It is, in my opinion, impossible to have a good career if you aren’t investing in a good life and so you must never sacrifice the things that most matter to you in the name of career progress.  That kind of sacrifice will ultimately backfire without fail.  It doesn’t matter if what matters to you is your husband, your wife, your children, your church, your fire department, your animal friends, your theater, your poetry, your dancing, or anything else.  It is these things that we do when we leave the office that nourish us and put us in the frame of mind to feel valued, to do our best work, to not be spiteful, but instead be as successful as we can be.  If you come to work feeling like what you really should be doing is something else, or that you failed to do that something else this past weekend, you cannot give matters at hand the attention they require and you will then fail at all the things.

It is my belief and experience that you will have the best career you can possibly have by taking care of all your needs.  It only superficially feels like you’re compromising your career to do "that other thing", in the end, you aren’t.

2. Bring all your experience to the job every day

Do not compartmentalize who you are.  You are a whole person. All those other things I mentioned up there, whatever they may be, they enrich you and make you better.  They are skills and knowledge that you can and should tap every day.  If you try to turn some of them off, or fail to consider the utility of some of the others, you are less than all you can be, and it will show in your work.  I’m unable to think of any endeavors a person could be passionate about that bring no material value to a professional context.

3. Find mentorship and sponsorship

Professional waters are difficult to navigate.  You put yourself at a huge disadvantage if you do not have colleagues that can help you avoid common mistakes and give you important perspective.  Mentors are out there for the asking, and while you may not succeed immediately I can promise that, as the saying goes, “When the student is ready, the master WILL appear.”  Sponsorship is as important but in a different way, and it can’t always be the same person.  I think of sponsorship as having a person or persons “out there” that have a good sense of who you are, what your desires are, where you are in your career stage, and so forth.  These people need to be in play so that when opportunities arise, they will be able to speak for you.  It is truly unfortunate if “Jane” doesn’t get that choice assignment and promo opportunity simply because nobody knew that she always wanted to work on “ratafrat development” and had experience from her hobby.

4. Keep investing in yourself

Ongoing professional and personal development is essential to be promoted in a corporate environment, but I think it’s also essential for your happiness.  If you end the year a [slightly?] better engineer, parent, husband, wife, poet, singer, dancer, or whatever matters to you… you’ve had a good year.  That’s worth celebrating.  Making yourself a better person will pay in the end.  I’ve personally been astonished how many times my little side things turned out to provide just the thing I needed down the line.

5. Self-advocacy is great, but do it in the way that fits best in your environment

As far as promotions go, the best way to get promoted is to be doing the work already.  The best way to get raises is to be extraordinarily competent.  But those things alone might not be enough.  Some corporations give raises through a fairly formalized process where asking for a raise really isn’t even an option.  But in those cases the questions to be asking are more like, “what could I be doing to earn my next promotion, I want to be a strong candidate” – from a career management position that’s still a great take-charge way to proceed.  And it shouldn’t be off-putting.  On the other hand, there are plenty of situations where getting raises is a personal experience that practically requires a specific request.  These are great things to discuss with a mentor.

Even if you’re not getting your promotions as often as you want, if you are constantly a candidate for promotion you can hardly be considered among the weaker team members.  You will command a good share of the budget for raises.

If you feel you are being treated unfairly, I can’t recommend silence, but sadly I can’t make a specific recommendation that’s good in every organization. 

6. Self-worth goes a long way to projecting, and inspiring confidence

In the end, getting assignments and promotions may come down to whether or not your management feels you can handle those situations.  If you are always cautious and appear to doubt your own self-worth, or need to be reminded of it constantly, you are unlikely to inspire the confidence needed to get the big jobs.  You are also much more likely to accept substandard pay because you think yourself unworthy of a fair wage. 

Trust your skills and yourself, do not accept the unacceptable, conduct yourself with integrity and professionalism, these things will inspire your organization to give you the rewards you deserve.  And if they don’t, it may be time to consider another organization.  A skilled worker knows their value.

Conclusion

I’m sorry if all those thoughts seem like so much trite advice, it seems like I said nothing noteworthy at all, and yet those are the things that people seem to most need to hear in my experience, so there they are.

Best wishes and thanks for reading.

Categories: Blogs

Performance Quiz #14: Memory Locality etc. Bonus Round!

Mon, 09/29/2014 - 21:54

[Note: I accidently pasted an x64 result in place of an x86 result. As it happens the point I was trying to make was that they were very similar, which they are, but they aren't identical... corrected.]

Thanks to some careful readers I discovered why the shuffling factors discussed in my previous blog entry were so strange.  The rand() method I was using only returns numbers between 0 and MAX_RAND which is 32767 on my system.  That meant that shuffling became a non-factor as the number of elements in the array increased.

I've since switched the code to use this instead:

#include <random> // at the top of the file

    void Shuffle(int m)
    {
        std::mt19937 gen(0);  // repeatable results desired
        std::uniform_int_distribution<> dist(0, _n - 1);

        for (int nn = 0; nn < m; nn++)
        {
            int i = dist(gen);
            int j = dist(gen);

            if (i == j) continue;

            int tmp = _perm[i];
            _perm[i] = _perm[j];
            _perm[j] = tmp;            
        }
    }

With this change shuffling becomes generally effective again and the result is that non-locality dominates the effects, which is really a lot more like what we would expect.  I thought that the effect might top off on very large array sizes but it does not.  The normal perfecting of in order memory continues to give us benefits even at very large array sizes.  Let's look at the architectural differences once again. 

Pointer implementation with no changes
sizeof(int*)=4, sizeof(T)=12
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   1.99,   1.99,   1.99,   1.99,   1.99,   1.99
     2000,   1.99,   1.99,   1.99,   1.99,   1.99,   1.99
     4000,   1.99,   1.99,   2.20,   2.49,   2.92,   3.20
     8000,   1.85,   1.96,   2.38,   3.13,   3.88,   4.34
    16000,   1.88,   1.96,   2.56,   3.40,   4.59,   4.94
    32000,   1.88,   2.17,   3.52,   6.20,   8.16,  11.24
    64000,   1.88,   2.33,   4.45,   8.13,  11.87,  15.96
   128000,   2.06,   2.27,   5.15,   9.32,  14.81,  18.29
   256000,   1.90,   2.35,   5.90,  10.99,  16.26,  21.39
   512000,   1.92,   2.77,   6.42,  11.59,  17.55,  23.71
  1024000,   2.22,   3.42,   7.96,  14.58,  22.97,  30.00
  2048000,   2.54,   4.68,  17.47,  38.54,  62.76,  90.17
  4096000,   2.50,   5.70,  25.17,  53.21,  89.79, 121.06
  8192000,   2.50,   6.31,  30.36,  63.07, 106.04, 142.09
 16384000,   2.50,   6.68,  32.94,  68.79, 115.24, 156.78
 32768000,   2.55,   7.22,  34.64,  73.71, 123.98, 165.47
 65536000,   2.55,   7.96,  36.49,  73.39, 126.17, 168.88

 OK here we see just how profound the non-locality effects are.  The cost per item goes up to a whopping 168.88 at the bottom of the table for fully shuffled data.  And you can see that even a small amount (1%) of non-locality makes a crushing difference.  You can also clearly see that while the data fits none of this matters a whit.  But even at 4k elements we're already starting to see the effects.

Now let's compare the same program run on x64.

Pointer implementation with no changes
sizeof(int*)=8, sizeof(T)=20
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   2.28,   1.99,   1.99,   1.99,   2.28,   1.99
     2000,   2.28,   2.42,   2.56,   2.84,   3.27,   3.27
     4000,   2.28,   5.12,   2.92,   3.84,   4.62,   5.05
     8000,   2.17,   2.28,   3.13,   4.34,   5.62,   5.90
    16000,   2.24,   2.49,   3.95,   7.02,   8.28,  10.97
    32000,   2.25,   2.54,   5.00,   8.98,  13.33,  16.75
    64000,   2.24,   2.91,   6.01,  10.22,  15.69,  19.93
   128000,   2.40,   2.93,   7.06,  11.51,  18.80,  22.25
   256000,   2.45,   2.96,   7.16,  13.97,  19.75,  24.89
   512000,   3.06,   3.49,   8.25,  19.20,  25.23,  28.36
  1024000,   3.59,   5.39,  16.21,  34.18,  60.90,  83.61
  2048000,   3.79,   7.18,  27.84,  58.19,  97.72, 130.46
  4096000,   3.79,   8.18,  34.17,  70.37, 118.09, 153.55
  8192000,   3.78,   8.72,  37.35,  76.35, 128.28, 166.00
 16384000,   3.76,   9.27,  39.17,  81.54, 137.48, 173.84
 32768000,   3.77,  10.06,  40.25,  83.91, 140.59, 178.86
 65536000,   3.74,  10.94,  43.60,  86.91, 142.89, 183.30

OK you can see the increase in size of the data structure significantly makes the unshuffled data worse, just like before.  And the size difference actually dilutes the non-locality somewhat.  It's merely 49 times worse on x64 rather than 66 times worse like on x86.  But it's still hella bad.   Note that a 66% data growth in this case is resulting in a 46% performance hit.  So actually x64 is doing worse overall not just because of raw data growth, the larger pointers are causing other inefficiencies.

Now I was going to skip the unaligned stuff thinking it would all be the same but I was wrong about that, these data end up being interesting too.  In the interest of saving space I'm just providing the first few rows of the table and the last few for these 4 experiments.

Pointer implementation with bogus byte in it to force unalignment
sizeof(int*)=4, sizeof(T)=13
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   2.28,   1.99,   2.28,   2.28,   1.99,   2.28
     2000,   1.99,   2.13,   2.13,   2.13,   2.13,   2.13
     4000,   2.06,   2.13,   2.70,   3.06,   3.63,   4.05
 16384000,   2.75,   7.39,  35.18,  74.71, 124.48, 167.20
 32768000,   2.74,   7.83,  37.35,  77.54, 126.54, 170.70
 65536000,   2.69,   8.31,  37.16,  77.20, 129.02, 175.87

OK naturally x86 is somewhat worse with unaligned pointers, but not that much worse. Mostly we can attribute this to the fact that the data structure size is now one byte bigger.  The 2.69 in the leading column is likely a fluke, so I'm going to go with 2.74 as the final value for unshuffled data.  Note that the execution time degraded by about 7.4% and the data grew by about 8.3% so really this is mostly about data growth and nothing else.  The shuffled data grew 4%, so the caching effects seem to dilute the size growth somewhat.

Pointer implementation with bogus byte in it to force unalignment
sizeof(int*)=8, sizeof(T)=21
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   2.28,   2.28,   2.28,   2.28,   2.28,   2.28
     2000,   2.42,   2.42,   2.84,   3.27,   3.70,   3.84
     4000,   2.42,   2.49,   3.34,   4.48,   5.48,   5.97
 16384000,   3.81,   9.58,  40.01,  83.92, 142.32, 181.21
 32768000,   3.96,  10.34,  41.85,  88.28, 144.39, 182.84
 65536000,   3.96,  11.37,  45.78,  92.34, 151.77, 192.66

Here's where things get interesting though.  On x64 the cost of unshuffled data grew to 3.96ns per item from 3.74.  That's a 5.8% growth.  And the shuffled data grew to 192 from 183, also about 5%.  That's pretty interesting because the data also grew 5%, exactly 5% as it turns out, from 20 to 21 bytes.

What happens if we grow it a little more to get perfect alignment?

Pointer implementation with extra padding to ensure alignment
sizeof(int*)=4, sizeof(T)=16
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   1.99,   1.99,   1.99,   1.99,   1.99,   1.99
     2000,   1.99,   1.99,   2.13,   2.13,   2.28,   2.13
     4000,   2.06,   1.99,   2.56,   3.34,   3.98,   4.05
 16384000,   3.04,   7.77,  35.52,  74.26, 125.54, 163.54
 32768000,   3.06,   8.44,  36.86,  77.08, 129.97, 168.43
 65536000,   3.08,   9.26,  38.16,  78.69, 129.42, 171.52

Well on x86, growing it doesn't help at all.  The size hit brings us to 3.08 and 171.52 on the bottom rows, the first number is worse, increasing from 2.74 so padding certainly didn't help there.  The second number is somewhat better... ~171 vs. ~175 perhaps alignment is helping us to avoid cache splits but it's only a 2% win and we paid a lot of space to get it (16 bytes instead of 13 or 23% growth.  What about on x64?)

Here the situation is interesting.   

Pointer implementation with extra padding to ensure alignment
sizeof(int*)=8, sizeof(T)=24
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   1.99,   1.99,   1.99,   2.28,   1.99,   1.99
     2000,   2.13,   2.28,   2.70,   2.99,   3.56,   3.70
     4000,   2.28,   2.28,   2.99,   3.91,   4.48,   4.98
 16384000,   4.30,   9.79,  38.55,  80.31, 133.95, 168.14
 32768000,   4.29,  10.81,  39.99,  83.10, 137.47, 168.60
 65536000,   4.29,  11.45,  44.03,  88.23, 143.56, 176.47

Look at that... things did get worse on unshuffled data due to the size growth, we're up to 4.29.  That's pretty much the result I expected, data size is king, especially in ordered data.  However, the last column is suprising.  With shuffled data we're now at 176.46.  Even though we grew the structure to 24 bytes it's actually running quite a lot faster than the 196 we got in the worst case.  Again I blame the fact that the random access is subject to cache splits and those are not present if the data is aligned.  In fact even a small amount of randomness is enough to make the aligned datastructures better.  At 10% shuffling we were already ahead despite the fact that the data is bigger.  So bottom line here, the unaligned pointer wasn't costing us much but creating cache splits definitely does, and we see those in the unshuffled data.  [Note: this is a bit of a stretch because I didn't actually gather stats on cache splits, I know that the odd size packed data must have them and they seem sufficient to explain the behavior, whilst the in order data will be largely immune, but that isn't full confirmation, so I could be wrong.)

Finally, looking at using indices instead of pointers.

Standard index based implementation
sizeof(int*)=4, sizeof(T)=12
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   3.70,   3.41,   3.41,   3.41,   3.70,   3.41
     2000,   3.41,   3.56,   3.41,   3.41,   3.41,   3.41
     4000,   3.63,   3.41,   3.63,   3.98,   4.34,   4.62
     8000,   3.45,   3.56,   3.84,   4.62,   5.19,   5.69
    16000,   3.52,   3.56,   4.00,   4.85,   5.67,   6.33
    32000,   3.48,   3.64,   5.12,   7.24,   9.83,  12.07
    64000,   3.48,   3.76,   6.10,   9.52,  13.78,  17.54
   128000,   3.48,   3.87,   6.70,  10.74,  15.90,  20.23
   256000,   3.48,   3.95,   7.34,  11.96,  17.48,  22.48
   512000,   3.46,   4.01,   7.75,  13.03,  19.69,  25.45
  1024000,   3.70,   4.89,  10.27,  16.68,  25.41,  32.78
  2048000,   3.80,   6.03,  19.54,  39.37,  65.81,  89.51
  4096000,   3.80,   7.12,  27.58,  55.90,  94.37, 125.29
  8192000,   3.78,   7.72,  32.42,  65.97, 110.21, 146.96
 16384000,   3.79,   8.07,  35.15,  71.50, 119.32, 159.43
 32768000,   3.63,   8.19,  35.50,  72.99, 121.11, 161.69
 65536000,   3.78,   9.18,  37.63,  76.34, 123.47, 164.85

As before, the x86 code is slower... there is nothing but downsize there since the pointers were already 32 bits.

Standard index based implementation
sizeof(int*)=8, sizeof(T)=12
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   3.41,   3.41,   3.41,   3.70,   3.41,   3.70
     2000,   3.41,   3.56,   3.41,   3.41,   3.41,   3.41
     4000,   3.56,   3.41,   3.63,   4.05,   4.34,   4.55
     8000,   3.41,   3.52,   3.88,   4.59,   5.19,   5.65
    16000,   3.43,   3.61,   4.00,   4.84,   5.76,   6.31
    32000,   3.47,   3.64,   5.08,   7.24,   9.78,  12.20
    64000,   3.48,   3.75,   6.11,   9.52,  13.84,  17.54
   128000,   3.49,   3.86,   6.71,  10.72,  15.86,  20.23
   256000,   3.48,   3.96,   7.34,  11.96,  17.83,  22.74
   512000,   3.45,   4.00,   7.75,  12.80,  19.16,  25.49
  1024000,   3.70,   5.27,  16.56,  16.10,  23.88,  32.46
  2048000,   3.80,   6.10,  19.48,  39.09,  66.05,  89.76
  4096000,   3.80,   6.94,  26.73,  54.67,  91.38, 125.91
  8192000,   3.79,   7.72,  32.45,  66.14, 110.40, 146.90
 16384000,   3.62,   7.98,  35.10,  71.72, 120.31, 159.13
 32768000,   3.77,   8.43,  36.51,  75.20, 124.58, 165.14
 65536000,   3.77,   9.15,  37.52,  76.90, 126.96, 168.40

And we see that the x64 run times very similar.  x64 provides no benefits, but we can use the data reduction provided by 32 bit pointers to save space and net a significant savings in execution time with even modestly (1%) shuffled data, an entirely normal situation.  And the space savings don't suck either.  The raw speed when comparing to fully ordered data is about a wash.

Just for grins I hacked up an implementation that uses 16 bit integers, here's what it looks like on x64, keeping in mind that you could only use this if your array sizes were known to be less than 64k (which they often are frankly). Or if you could flexibly swap it in.

Standard  short index based implementation
sizeof(int*)=8, sizeof(T)=6
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   3.98,   3.98,   3.98,   3.98,   3.98,   3.98
     2000,   3.84,   3.98,   3.84,   3.98,   3.84,   3.84
     4000,   3.91,   4.05,   3.91,   3.91,   3.91,   3.91
     8000,   3.95,   3.98,   4.12,   4.37,   4.69,   4.94
    16000,   4.00,   3.98,   4.34,   4.84,   5.49,   6.04
    32000,   3.95,   4.06,   4.46,   5.11,   6.11,   6.28
    64000,   3.93,   4.09,   5.55,   7.05,   9.77,  11.45

What's interesting here is that  even though it's smaller the short word fetching on the x64 seems inferior and so at 64k elements the unshuffled cost is higher.  But if the data varies then the space savings kicks in and you get a nice win.  Its sort of like moving yourself one notch down in the table at that point.

Categories: Blogs

Performance Quiz #14: Memory Locality, x64 vs. x86, Alignment, and Density

Mon, 09/29/2014 - 05:26

[

Note: Mystery of the shuffling is solved, the rand() method I was using returns only numbers between 0 and 32k, so shuffling was ineffective in large array sizes.  I will post an update.  Thank you Ryan!

See the new entry for the updated results.

 It's been a very long time since I did a performance quiz and so it's only right that this one covers a lot of ground.  Before I take even one step forward I want you to know that I will be basing my conclusions on:

  1. A lot of personal experience
  2. A micro-benchmark that I made to illustrate it

Nobody should be confused, it would be possible to get other results, especially because this is a micro-benchmark.  However these results do line up pretty nicely with my own experience so I'm happy to report them.  Clearly the weight of the "other code" in your application would significantly change these results and yet they illustrate some important points, as well as point out a mystery...  But I'm getting ahead of myself discussing the answers.... First, the questions:

 

Q1: Is x64 code really slower than x86 code if you compile basically the same program and don't change anything just like you said all those years ago? (wow, what a loaded question)

Q2: Does unaligned pointer access really make a lot of difference?

Q3: Is it important to have your data mostly sequential or not so much?

Q4: If x64 really is slower, how much of it relates to bigger pointers?

 

OK kiddies... my answers to these questions are below.... but if you want to make your own guesses then stop reading now... and maybe write some code to try out of a few theories.

 

 

 

 

 

Keep scrolling...

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Are you ready?

 

 

 

 

OK.

To answer these questions I wrote a benchmarking program (see link at the bottom of this posting) that creates a data structure and walks it.  The primary thing that it does is allocate an array with ever increasing size and then build a doubly-linked list in it.  Then it walks that list forwards then backwards.  The time it takes to walk the list is what is measured, not the construction.  And the times reported are divided by the number of items, so in each case you see the cost per item.  Each item is of course visited twice, so if you like the numbers are scaled by a factor of two.  And the number reported is in nanoseconds.

To make things more interesting, I also shuffle the items in the list so that they are not in their original order.  This adds some randomness to the memory access order.  To shuffle the data I simple exchange two slots a certain percentage of times starting from 0% and then growing quickly to 100%.  100% shuffles means the number of exchanges is equal to the number of items in the list, that's pretty thoroughly mixed.

And as another datapoint I run the same exact code (always compiled for maximum speed) on x64 and then on x86.  It's the exact same machine,  my own home desktop unit.  Which is a hexacore high end workstation.

And then some additional test cases.  I do all that 4 different ways.  First with regular next and prev pointers and an int as the payload.  Then I add a bogus byte just to make the alignment totally horrible (and by the way it would be interesting to try this on another architecture where alignment hurts more than Intel but I don't happen to have such a machine handy).  To try to make things better I add a little padding so that things still line up pretty good and we see how this looks.  And finally I avoid all the pointer resizing by using fixed size array indices instead of pointers so that the structure stays the same size when recompiled.

And without further ado, here are the results.  I've put some notes inline. 

 


Pointer implementation with no changes sizeof(int*)=4  sizeof(T)=12   shuffle 0% 1% 10% 25% 50% 100% 1000 1.99 1.99 1.99 1.99 1.99 1.99 2000 1.99 1.85 1.99 1.99 1.99 1.99 4000 1.99 2.28 2.77 2.92 3.06 3.34 8000 1.96 2.03 2.49 3.27 4.05 4.59 16000 1.97 2.04 2.67 3.57 4.57 5.16 32000 1.97 2.18 3.74 5.93 8.76 10.64 64000 1.99 2.24 3.99 5.99 6.78 7.35 128000 2.01 2.13 3.64 4.44 4.72 4.8 256000 1.98 2.27 3.14 3.35 3.3 3.31 512000 2.06 2.21 2.93 2.74 2.9 2.99 1024000 2.27 3.02 2.92 2.97 2.95 3.02 2048000 2.45 2.91 3 3.1 3.09 3.1 4096000 2.56 2.84 2.83 2.83 2.84 2.85 8192000 2.54 2.68 2.69 2.69 2.69 2.68 16384000 2.55 2.62 2.63 2.61 2.62 2.62 32768000 2.54 2.58 2.58 2.58 2.59 2.6 65536000 2.55 2.56 2.58 2.57 2.56 2.56   Average 2.20 2.38 2.86 3.27 3.62 3.86 Overall 3.03

 

This is the baseline measurement.  You can see the structure is a nice round 12 bytes and it will align well on x86.  Looking at the first column, with no shuffling, as expected things get worse and worse as the array gets bigger until finally the cache isn't helping much and you have about the worst you're going to get, which is about 2.55ns on average per item.

The results for shuffling are not exactly what I expected.  At small sizes, it makes no difference.  I expected this because basically the entire table is staying hot in the cache and so locality isn't mattering.  Then as the table grows you see that shuffling has a big impact at about 32000 elements.  That's 384k of data.  Likely because we've blown past a 256k limit.

Now the bizarre thing is this: after this the cost of shuffling actually goes down, to the point that later on it hardly matters at all.  Now I can understand that at some point shuffled or not shuffled really should make no difference because the array is so huge that runtime is largely gated by memory bandwidth regardless of order.  However... there are points in the middle where the cost of non-locality is actually much worse than it will be at the endgame.

What I expected to see was that shuffling caused us to reach maximum badness sooner and stay there.  What actually happens is that at middle sizes non-locality seems to cause things to go very very bad...  And I do not know why :)

But other than that one anomaly things are going pretty much as expected.

Now let's look at the exact same thing, only it's now on x64

 

Pointer implementation with no changes

sizeof(int*)=8  sizeof(T)=20   shuffle 0% 1% 10% 25% 50% 100% 1000 2.28 2.28 2.28 1.99 2.28 1.99 2000 2.28 2.28 2.56 2.99 3.13 3.27 4000 2.28 2.35 3.06 3.91 4.84 5.26 8000 2.28 2.38 3.27 4.48 5.9 6.15 16000 2.36 2.63 4.12 6.28 8.53 10.2 32000 2.36 2.68 5.3 9.24 13.4 16.76 64000 2.25 2.9 5.5 8.28 10.36 10.62 128000 2.42 2.92 4.86 6.31 6.49 6.34 256000 2.42 2.74 4.25 4.52 4.43 4.61 512000 2.75 3.86 4.31 4.42 4.56 4.48 1024000 3.56 4.82 5.42 5.42 5.28 5.21 2048000 3.72 4.36 4.64 4.64 4.66 4.67 4096000 3.79 4.23 4.2 4.23 4.2 4.23 8192000 3.77 3.99 3.98 4 3.99 3.99 16384000 3.75 3.88 3.87 3.87 3.89 3.89 32768000 3.78 3.86 3.83 3.8 3.81 3.83 65536000 3.74 3.8 3.79 3.81 3.83 3.82   Average 2.93 3.29 4.07 4.83 5.50 5.84 Overall 4.41 X64/X86 1.46

 

Well would you look at that... the increased data size has caused us to go quite a bit slower.  The average ratio shows that execution time is 1.46 times longer.  This result is only slightly larger than typical in my experience when analyzing data processing in pointer rich structures.

Note that it doesn't just get bad at the end, it's bad all along.  There's a few weird data points but this isn't an absolutely controlled experiment.  For instance the 1.99 result for 1000 items isn't really indicating that it was better with more shuffling.  The execution times are so small that timer granularity is a factor and I saw it swiching between 1.99 and 2.28.  Things get a lot more stable as n increases.

Now let's look what happens when the data is unaligned.

 

Pointer implementation with bogus byte in it to force unalignment

sizeof(int*)=4  sizeof(T)=13   shuffle 0% 1% 10% 25% 50% 100% 1000 1.99 1.99 1.99 1.99 2.28 1.99 2000 2.13 2.13 2.13 2.13 2.13 2.13 4000 2.13 2.13 2.49 3.06 3.7 3.91 8000 2.1 2.17 2.88 3.88 4.76 5.33 16000 2.1 2.2 3.08 4.21 5.4 6.17 32000 2.17 2.39 4.21 6.92 10.1 12.83 64000 2.16 2.46 4.5 6.74 8.18 8.62 128000 2.14 2.45 4.13 5.19 5.4 5.41 256000 2.14 2.41 3.61 3.78 3.77 3.77 512000 2.18 2.51 2.97 3.12 3.16 3.11 1024000 2.45 3.12 3.44 3.43 3.46 3.54 2048000 2.76 3.3 3.36 3.35 3.37 3.36 4096000 2.75 3.08 3.05 3.04 3.07 3.05 8192000 2.75 2.9 2.88 2.9 2.9 2.9 16384000 2.75 2.82 2.82 2.82 2.82 2.82 32768000 2.74 2.78 2.77 2.79 2.77 2.78 65536000 2.74 2.76 2.75 2.75 2.76 2.76   Average 2.36 2.56 3.12 3.65 4.12 4.38 Overall 3.37

 

This data does show that things got somewhat slower.  But also data size grew by about 8%.  In fact if you look at first column and compare the bottom row there, you'll find that amortized execution at the limit grew by 7.4%  Basically the same as the data growth.  On the other hand, changes due to shuffling were greater so that the overall index grew by 8.3%.  But I think we could support the conclusion that most of the growth had to do with the fact that we read more memory and only a small amount of it had to do with any extra instruction cost.

Is the picture different on x64?

 

Pointer implementation with bogus byte in it to force unalignment

sizeof(int*)=8  sizeof(T)=21   shuffle 0% 1% 10% 25% 50% 100% 1000 2.28 2.28 2.28 2.28 2.28 2.28 2000 2.42 2.42 2.84 3.27 3.7 3.84 4000 2.42 2.49 3.34 4.48 5.55 6.12 8000 2.56 2.52 3.7 5.23 6.4 7.15 16000 2.61 2.81 4.85 7.36 9.96 12.02 32000 2.53 2.86 5.8 10.18 15.25 18.65 64000 2.53 2.94 5.88 9.14 11.33 11.64 128000 2.53 2.94 5.41 7.11 7.09 7.09 256000 2.57 3.09 5.14 4.96 5.07 4.98 512000 3.21 3.58 5.29 5.05 5.14 5.03 1024000 3.74 5.03 5.94 5.79 5.75 5.94 2048000 4.01 4.84 4.96 4.93 4.92 4.96 4096000 4 4.47 4.49 4.46 4.46 4.46 8192000 3.99 4.21 4.21 4.21 4.06 4.21 16384000 3.97 4.08 4.08 4.07 4.08 4.08 32768000 3.96 4.02 4.02 4.03 4.03 4.03 65536000 3.96 3.99 4 3.99 4 3.99   Average 3.13 3.45 4.48 5.33 6.06 6.50 Overall 4.83 X64/X86 1.43

 

The overall ratio was 1.43 vs the previous ratio of 1.46.  That means the extra byte did not disproportionally affect the x64 build either.  And in this case the pointers are really crazily unaligned.  The same shuffling weirdness happen as before.

Unaligned pointers don't seem to be costing  us much.

What about if we do another control, increasing the size and realigning the pointers.

 

Pointer implementation with extra padding to ensure alignment

  sizeof(int*)=4  sizeof(T)=16   shuffle 0% 1% 10% 25% 50% 100% 1000 1.99 1.99 1.99 1.71 1.99 1.99 2000 1.99 1.99 2.13 2.13 2.13 2.13 4000 2.28 1.99 2.49 3.34 3.7 4.05 8000 1.99 2.06 2.74 3.66 4.59 5.08 16000 2.04 2.26 3.16 4.18 5.32 6.06 32000 2.04 2.35 4.44 7.43 10.92 14.2 64000 2.04 2.38 4.6 7.03 8.74 9.11 128000 2.03 2.37 4.24 5.42 5.58 5.59 256000 2.05 2.36 3.66 3.84 3.83 4.07 512000 2.22 2.59 3.15 3.37 3.1 3.39 1024000 2.76 3.81 4.1 4.09 4.26 4.18 2048000 3.03 3.66 3.83 3.82 3.78 3.78 4096000 3.04 3.42 3.4 3.43 3.41 3.42 8192000 3.06 3.23 3.24 3.23 3.24 3.24 16384000 3.05 3.15 3.14 3.14 3.13 3.14 32768000 3.05 3.1 3.1 3.09 3.1 3.09 65536000 3.07 3.08 3.07 3.08 3.07 3.08   Average 2.45 2.69 3.32 3.88 4.35 4.68 Overall 3.56

 

Well in this result we converge at about 3.07, and our original code was 2.55.  Certainly re-aligning the pointers did not help the situation.  We're actually just 20% than the original number and 12% worse than the unaligned version.

And let's look at x64...

 

Pointer implementation with extra padding to ensure alignment

  sizeof(int*)=8  sizeof(T)=24   shuffle 0% 1% 10% 25% 50% 100% 1000 1.99 1.99 1.99 1.99 1.99 1.99 2000 2.13 2.28 2.7 2.99 3.7 3.7 4000 2.2 2.28 2.99 3.84 4.55 4.84 8000 2.42 2.38 3.34 4.37 4.98 5.37 16000 2.45 2.68 4.55 7.04 9.71 11.88 32000 2.46 2.8 5.43 9.25 13.48 17.16 64000 2.42 2.8 5.46 8.46 10.37 10.7 128000 2.4 2.8 5 6.43 6.55 6.56 256000 2.51 3.18 4.92 5.34 5 4.89 512000 3.9 4.7 5.97 6.5 5.63 5.59 1024000 4.15 5.24 6.34 6.28 6.24 6.33 2048000 4.32 5.13 5.28 5.33 5.34 5.27 4096000 4.32 4.78 4.77 4.81 4.78 4.79 8192000 4.29 4.55 4.55 4.56 4.55 4.54 16384000 4.28 4.42 4.42 4.43 4.42 4.42 32768000 4.3 4.36 4.37 4.37 4.38 4.37 65536000 4.23 4.38 4.35 4.34 4.34 4.33   Average 3.22 3.57 4.50 5.31 5.88 6.28 Overall 4.79 X64/X86 1.35

 

Now with the extra padding we have 8 byte aligned pointers, that should be good right?  Well, no, it's worse.  The top end is now about 4.3 nanoseconds per item compared with about 4 nanoseconds before.  Or about 7.5% worse having used more space.  We didn't pay the full 14% of data growth so there is some alignment savings but not nearly enough to pay for the space.  This is pretty typical.

 

And last but not least, this final implementation uses indices for storage instead of pointers.  How does that fare?

 

Standard index based implementation

 

sizeof(int*)=4  sizeof(T)=12   shuffle 0% 1% 10% 25% 50% 100% 1000 3.41 3.7 3.41 3.41 3.41 4.27 2000 3.41 3.56 3.41 3.41 3.41 3.98 4000 3.41 3.48 3.63 3.98 4.41 4.62 8000 3.41 3.59 3.88 4.62 5.23 5.76 16000 3.43 3.48 4.02 4.8 5.76 6.31 32000 3.5 3.64 5.1 7.2 9.8 11.99 64000 3.48 3.74 5.41 7.26 8.52 8.88 128000 3.49 3.72 5.1 5.98 6.17 6.18 256000 3.48 3.7 4.66 4.82 4.83 4.82 512000 3.52 3.72 4.13 4.24 4.14 4.3 1024000 3.57 4.25 4.6 4.59 4.46 4.43 2048000 3.79 4.23 4.37 4.35 4.36 4.34 4096000 3.77 4.05 4.06 4.06 4.06 4.07 8192000 3.77 3.91 3.93 3.92 3.91 3.93 16384000 3.78 3.84 3.83 3.83 3.84 3.84 32768000 3.78 3.8 3.8 3.8 3.8 3.79 65536000 3.77 3.78 3.78 3.78 3.8 3.78   Average 3.57 3.78 4.18 4.59 4.94 5.25 Overall 4.39

 

Well, clearly the overhead of computing the base plus offset is a dead loss on x86 because there is no space savings for those indexes.  They are the same size as a pointer so messing with them is pure overhead.

However... let's look at this test case on x64...

 

Standard index based implementation

sizeof(int*)=8  sizeof(T)=12   shuffle 0% 1% 10% 25% 50% 100% 1000 3.41 3.41 3.41 3.98 3.41 3.41 2000 3.41 3.41 3.7 3.41 3.41 3.41 4000 3.41 3.48 3.63 3.98 4.34 4.76 8000 3.45 3.45 3.84 4.48 5.33 5.69 16000 3.48 3.57 3.98 4.78 5.71 6.28 32000 3.48 3.64 5.11 7.16 9.69 11.99 64000 3.48 3.73 5.37 7.2 8.47 8.84 128000 3.48 3.72 5.1 5.96 6.25 6.14 256000 3.49 3.69 4.66 4.83 4.82 4.88 512000 3.52 3.72 4.22 4.22 4.22 4.24 1024000 3.59 4.01 4.31 4.53 4.45 4.4 2048000 3.8 4.27 4.33 4.25 4.35 4.38 4096000 3.8 3.97 4.06 4.06 4.07 4.06 8192000 3.79 3.92 3.92 3.93 3.93 3.91 16384000 3.77 3.84 3.83 3.82 3.85 3.85 32768000 3.76 3.81 3.81 3.8 3.8 3.81 65536000 3.76 3.78 3.78 3.79 3.78 3.78   Average 3.58 3.73 4.18 4.60 4.93 5.17 Overall 4.37 X64/X86 1.00

 

And now we reach our final conclusion... At 3.76 the top end is coming in a dead heat with the x86 implementation.  The raw x64 benefit in this case is basically zip.  And actually this benchmark tops out at about the same cost per slot as the original pointer version but it uses quite a bit less space (40% space savings).  Sadly the index manipulation eats up a lot of that savings so in the biggest cases we only come out about 6% ahead.

 

 

Now of course it's possible to create a benchmark that makes these numbers pretty much whatever you want them to be by simply manipulating how much pointer math there is, vs. how much reading, vs. how much "actual work".   

And of course I'm discounting all the other benefits you get from running on x64 entirely, this is just a memory cost example, so take this all with a grain of salt.  If there's a lesson here it's that you shouldn't assume things will be automatically faster with more bits and bigger registers, or even more registers.

The source code I used to create this output is available here

 

*The "Average" statistic is the average of the column above it

*The "Overall" statistic is the average of all the reported nanosecond numbers.

*The x64/x86 ratio is simply the ratio of the two "Overall" numbers

Categories: Blogs

Wow I love git-tf!

Sat, 09/27/2014 - 00:10

I switched to git about 3 years ago because the portability was so great.  Moving work between computers couldn't be easier.  But when I did that I lost all my TFS history from TFS express I had been using up until then.  4 years of useful history.

Last week I saw this git-tf thing so I restored my old TFS databases on a box, put Team Server Express on that puppy and did an in place upgrade. Bam, up it comes as fast as you can say 'I wish I had a faster internet connection"

5 minutes later I had a git clone of my old stuff..

Then I just added a remote, did a fetch, and rebased my whole master branch on top of my old history.

And just like that it's all back!

OK my commit ID's are all different now but that's ok, it's just me anyway.

I pushed it all up to my visual studio account using VS Git integration and now all of it is backed up.

I feel like someone just returned a lost pet to my house :)

Categories: Blogs

You don't have to write it (all) first...

Thu, 09/25/2014 - 05:06

It seems like I get pretty much the same questions all the time.  A common one is, "Rico can you tell me if it would be ok for me to use <technology> to solve this <problem>?  How much does <technology> cost anyway?"

The answer is (nearly) always the same: "How the hell should I know?"

This is frequently followed by lamentations that there isn't some giant book of costs in which you can look up <technology> and see how expensive it is in <units>.

The contrapositive of this situation is where a person decides that they can't possibly know what anything is going to cost until they build the system and measure it.  I suppose if you're being pedantic that's true but as a practical matter nothing could be further from the truth.  Both situations have the same resolution.

If you want to get a sense of what something is going to cost, build a little gizmo that does kinda sorta what you have in mind and measure it to get a feel for the costs.  Don't do all the work of getting your business logic right and the specifics perfect, throw something together on the cheap that is going to give you a feel for the essential costs you are going to have. 

Here's a very specific example that came up not too long ago:  someone wanted an opinion on what it would cost to convert their data protocol from http to https.  How do you get that answer?  Well getting a complete answer will require a big investment but getting a sense of if you can afford this and what to look for is remarkably easy.  Write a little thingy that fetchs about the right amount of data with http then replace it with https -- this is like a 10 minute exercise.  Then look at both from a CPU perspective, and maybe network I/O, and DLLs loaded/code used too.  Try various request sizes and in particular some that are kinda like what you intend to do.  This can very quickly give you a sense of whether this is something you can afford and will give you excellent insights into what your final costs will actually be.  As your real code comes along, keep looking at it using your initial measurements as a touchstone.  If your costs start to be wildly different than your initial forecast, then you missed something.  If that happens you're going to know a lot sooner than the finish line that things are not going as expected.

This technique works in many domains -- database connection and query costs, UI frameworks, data processing, serialization, pretty much anything -- it's kind of like test driven development but for performance factors rather than correctness.

Whatever you do, don't tolerate a "we won't know until the end" attitude or, just as bad, "ask some expert, they're bound to know..."

Experiments are your friend.

 

 

Categories: Blogs

To preload or not to preload...

Fri, 08/29/2014 - 21:25

Q:

My application starts slowly, I want to preload it to avoid that problem.   Should I be worried?

A:

Well, in short, there are lots of concerns.  Preloading things you may or may not need is a great way to waste a ton of memory and generally make the system less usable overall.

I’m often told that that the answer to a performance problem is to simply preload the slow stuff… unfortunately that doesn’t work as a general solution if everyone does it.  It’s classic “improve the benchmark” thinking.

When developing for Windows you have to think about all kinds of scenarios, such as the case where there are several hundred users trying to share a server each with their own user session.  Your application might also need to run in a very memory constrained environments like a small tablet or some such – you do not want to be loading extra stuff in those situations. 
 
The way to make a system responsive is to KEEP IT SIMPLE.  If you don’t do that, then it won’t matter that you’ve preloaded it -- when the user actually gets around to starting the thing in a real world situation, you will find that it has already been swapped out to try to reclaim some of the memory that was consumed by preloading it.  So you will pay for all the page faults to bring it back, which is probably as slow as starting the thing in the first place.  In short, you will have accomplished nothing other than using a bunch of memory you didn’t really need.

Preloading in a general purpose environment is, pretty much a terrible practice.  Instead, pay for what you need when you need it and keep your needs modest.  You only have to look at the tray at bottom right on your screen full of software that was so sure it was vitally important to you that it insisted on loading at boot time to see how badly early loading scales up.

Adding fuel to this already bonfire-sized problem is this simple truth: any application preloading itself competes with the system trying to do the very same thing.  Windows has long included powerful features to detect the things you actually use and get them into the disk cache before you actually use them, whether they are code or data.  Forcing your code and data to be loaded is just as likely to create more work evicting the unnecessary bits from memory to make room for something immediately necessary, whereas doing nothing would have resulted in ready-to-go bits if the application is commonly used with no effort on your part.

See: http://en.wikipedia.org/wiki/Windows_Vista_I/O_technologies

Bottom line, preloading is often a cop out.  Better to un-bloat.

Categories: Blogs

On adopting high end perf tools to study micro-architectural phenomena

Fri, 08/29/2014 - 20:05

Huge words of caution: you can bury yourself in this kind of stuff forever and for my money it is rarely the way to go.  It’s helpful to know where you stand on CPI for instance but it’s much more typical to get results by observing that you (e.g.) have a ton of cache misses and therefore should use less memory.  Using less memory is always a good thing.

You could do meaningful analysis for a very long time without resorting to micro-architectural phenomena simply by studying where your CPU goes.

It is not only the case that (e.g.) ARM does things differently than (e.g.) x86 products, it is also the case that every x86 processor family you have ever heard of does it differently than every other one you have ever heard of.  But that turns out to be not that important for the most part.  Because the chief observations like “we branch too much” are true universally.  Just as “we use too much memory” is basically universally true.

The stock observations that you should:

1. Use less memory
2. Use fewer pointers and denser data structures
3. Not jump around so much

Are essentially universally true.  The question really comes down to what can you get away with on any given processor because its systems will save the day for you.  But even that is a bit of a lie, because the next question is “what else could you be doing an your program would still run well?” because the fact is there is always other stuff going on and if you minimize your use of CPU resources generally you will be a better citizen overall.

In short, the top level metrics, CPU, Disk, Memory, Network, will get your very far indeed without resorting to mispredicts and the like.  If you want to use the tools effectively, with broad results, I strongly recommend that you target the most important metrics, like L2 cache misses, and reduce them.  That’s always good.  Pay much less attention to the specific wall-clock consequence in lab scenarios and instead focus on reducing your overall consumption.

And naturally this advice must be tempered with focus on your customers actual problems and forgive me for being only approximately correct in 400 words or less.

 

Categories: Blogs