One of my readers asked if I had any particular plans around memory density for Visual Studio to go with my last posting and indeed I had one thought that I considered a very powerful way to create data structures that were more likely to be scalable. Like most of my ideas it’s a very simple thought that I think is still effective, and I still give the same advice because I’m boring like that.
Basically, it’s this: rather than design your data structures the way they taught you in your CS2xx class, design them like you were going to store them in a SQL database. In fact, you are likely to be well served by using a schema designer. Then store them exactly like that in memory. RAM is the new disk. Cache is the new RAM.
In fact, I suggested at that time that we write a thing I called “Linq To Memory” – a stark contrast from “Linq to Objects” to help facilitate this. Basically Linq to Memory was a hypotheticial thing that would be more like an in-memory-database with “tables” that were based on dense structures like b-tree and such but no query language other than Linq needed.
Because these choices give you an “automatic” way to think about isolation, locking, multiple readers, and even multiple writers if needed. And they do so comparatively economically. And if you include the compiled query type options that are present in Linq to SQL or Linq to Entities then you can get something very efficient for read access.
Note that this design actively discourages the forest of pointers that you normally get in OOP and yet you get a nice object model to work with (like Linq to SQL gives you) with the slices you are actually working with hydrated into objects and the other parts densely stored.
This metaphor easily supports multiple ACID choices with as much or as little durability and isolation as you want.
It also facilitates things like 64 bit systems because you do not have to pay a 64 bit pointer or five when a smaller or variable sized index would do the job. Yet you can have huge amounts of data if you need them.
Natural keys facilitate locality as do typical database style indices.
One time overhead for creating your compiled queries to access the data is amortized over the life of the program and saves due to compactness continue to yield benefits forever. Lack of pointers in the primary storage also reduces garbage collection costs massively.
MANY systems are well suited for this choice. And as a primary storage it provides both clarity and economy.
I haven’t seen anything like this but it’s totally buildable, I knew where to find the back-end for it in-house. It's good not only for VS but for many applications with non-trivial data storage requirements.
I wrote this a long time ago. It's very interesting to me how applicable this is today. And how it is not very specific to Visual Studio at all...
Low Level Considerations for VS of the Future
Rico Mariani, Sept 12/2007
I’ve been giving much thought to what enabling steps we have to take to make our VS12+ experience be something great. I will not be trying to talk about specific features at all, I’m not sure we are anywhere near the point where we could talk about such things but I do consider the prototype we’ve seen as sort of representative of what the underlying needs would be for a whatever experience we ultimately create.
So without further ado, here are some areas that will need consideration with information as to what I think we need to do, broadly, and why.
Memory Usage Reduction
Although we can expect tremendous gains in overall processor cycles available to us within the next decade we aren’t expecting similar gains in memory availability – neither L2/L0 capacity, nor total bandwidth. Project sizes, file sizes, and code-complexity generally is going up but our ability to store these structures is not increasing.
To get past this, and enable the needed parallelism we must dramatically reduce our in-memory footprint for all key data structures. Everything must be “virtualized” so that just what we need is resident. In a IDE world with huge solutions in many windows vast amounts of the content must be unrealized. Closing is unnecessary because things that are not displayed have very little cost.
Investment -> Ongoing memory cost reduction
In addition to having less overall memory usage, our system must also have very dense data structures. Our current system trends to the “forest of pointers” direction with in-memory representations being not only larger than on-disk representations but also comparatively rich in pointers light on values.
More value oriented data-structures with clear access semantics (see below) will give us the experience we need. The trend to expand data to “pre-computed” forms will be less favorable than highly normalized forms because extra computation will be comparatively cheaper than expanding the data and consuming memory. Memory is the new disk.
Investment -> Drive density into key data structure, use value rich types
Key data structures, like documents, need the notion of transactions to create unit of work needs. This is important for many cases where rollback is critical. There need not be anything “magic” about this – it doesn’t have to be transacted memory, it’s merely transaction support in the API. Important to help resolve parallelism concerns, and disconnected data concerns, the notion that reads and writes might fail is important and at that point atomicity of operations is crucial. Transacted documents may or may not imply locking or versioning but don’t have to. See the next topic.
Investment -> Key data-structures gain a transacted API
Investment -> Key data-structures use transactions to communicate “scope of change” after edits.
Having an isolation model means you can understand what happens when someone changes the data structure out from under you. What changes will you see and what won’t you. How much data can you read and expect to be totally self-consistent.
Investments -> Key data-structures must have an isolation model
A UX that is not strongly STA based
The IDE of the future must focus presentation in a single concentrated place but, in a game-like fashion, produce “display lists” of primitives, delivered to the rendering system in an “inert” fashion (i.e. as data not as API calls) so that presentation is entirely divorced from computing the presentation and the rendering system is in fact always ready to render.
This is entirely harmonious with the fact that we want all these display lists to be virtualized. You need not “close” windows – merely moving them out of frame is enough to reclaim the resources associated with the display. The 1000s of windows you might have cost you nothing when you are not looking at them.
Investment -> A standard method for pipelining ready-to-render displays
Investment -> an STA free programming model with non-STA OS services (e.g. file open dialog)
Twin Asynchronous Models Pervasively
It is easiest to illustrate these two needs by example:
- Search for “foo”
- This can be done in a totally asynchronous fashion with different “threads” proceeding in parallel against different editor buffers or other searchable contexts. All the “foos” light up as they are discovered.
- Foos that are not visible need not be processed, so this sort of parallelism is “lazy”
- We can still do this asynchronously but now there is an expectation that we will eventually do it all and of course we must do all the foos not just the visible ones
- Generally this requires progress, and a transaction
- It can fail, or be cancelled, these are normal things
Both of these represent two cases where we can use the “coarse” parallelism model. Of course we also would like to do fine-grained parallelism in places and in fact searching could be amenable to this as well. Fine-grained has the opportunity to keep locality good because you stream over the data once rather than access disparate data streams all at once but it is of course more complicated. Enabling both of these requires the previous investments:
- Limited memory usage in the face of many threads running
- Ability to access the data in slices that are durable and limited in span
- Clear boundaries for write operations that may happen while the work is going on
- This allow allows for notifications
- This allows for a clear place to re-try in the event of deadlocks/failures
- Isolation so that what you see when you change or update is consistent to the degree that was promised (it’s part of the contract)
- No STA UI so that you can present the results in parallel without any cosmic entanglements outside of the normal locking the data-structures require – no “stealth” reentrancy
Investment -> “Lazy” asynchronous display operations
Investment -> Asynchronous change operations
Investment -> Failure tolerant retry like in good client server applications
With the above investments in place it becomes possible to use multi-threading APIs to create high speed experiences that allow for real-time zooming, panning, searching, background building, rich intellisense, and other algorithms that can exploit the coarse parallelism abundant in large solutions. This can be done with a variety of threading models – I like data pipelined models myself but they aren’t necessary. Access to key data-structures increasingly appears to be similar to how a database would be handled to customers of the data.
Even with little to no change in actual processor dispatch mechanisms we get these benefits:
- The underlying data storage is enabled to take advantage of CPU resources available
- Other factors which would make parallelism moot are mitigated
Under these circumstances a “game quality” UX is possible – that’s the vision for VS12. To get there, we need to start thinking about these considerations now so that our designs begin to reflect our future needs.
I have been using this approach to do systematic analysis of performance regressions for several years now. I came up with it while looking at some tricky problems in Internet Explorer about three years ago and it’s served me well since then. The idea is pretty a simple one but it gives surprisingly good results in many cases.
I’ll be giving examples that talk about CPU as the metric but of course the same procedure works for any metric for which you can compute inclusive costs.
Nomenclature: Inclusive cost (e.g. time) is the cost in a method and everything it calls. Exclusive cost (e.g. time) is the cost from only the method itself, not counting anything it calls. Both are interesting but this technique really relies on inclusive cost.
Now the usual situation: You have some test that used to take say 50ms and now it takes 55ms. That’s a 10% growth. You want to know where to start looking and you’re fortunate enough to have a summary of costs from before and after. But there could be thousands of symbols and the costs could be spread all over the place. Also some symbols might have been renamed or other such inconvenient things. You could try staring at the traces in call-tree outlining but that gets very tedious especially if the call stacks are 50 levels deep or so. It’s when things get big and messy that having an analysis you can automate is helpful. So here’s how I do it.
First I consider only symbols that appear in both traces, that’s not everything but it’s a lot and is typically enough to give you a solid hint. For each symbol I know the inclusive cost in the base case and test case, from this I can compute the delta easily enough to tell me how much it grew. Now the magic. Since I know how much the overall scenario regressed (10% in this example) I can easily compute how much any particular symbol should have gotten slower if we take as our null hypothesis that “bah, it’s all just evenly slower because it sucks to be me” so we compute that number. So a symbol that had a previous cost of 10 in my example here should have a growth of 10% or a delta of 1. We compute the ratio of the actual delta to the observed delta and that is called the “overweight percentage” and then we sort on that. And then stuff starts popping out like magic.
I’ll have more examples shortly but let’s do a very easy one so you can see what’s going on. Suppose main calls f and g and does nothing else. Each takes 50ms for a total of 100ms. Now suppose f gets slower, to 60ms. The total is now 110, or 10% worse. How is this algorithm going to help? Well let’s look at the overweights. Of course main is 100 going to 110, or 10%, it’s all of it so the expected growth is 10 and the actual is 10. Overweight 100%. Nothing to see there. Now let’s look at g, it was 50, stayed at 50. But it was “supposed” to go to 55. Overweight 0/5 or 0%. And finally, our big winner, f, it went from 50 to 60, gain of 10. At 10% growth it should have gained 5. Overweight 10/5 or 200%. It’s very clear where the problem is! But actually it gets even better. Suppose that f actually had two children x and y. Each used to take 25ms but now x slowed down to 35ms. With no gain attributable to y, the overweight for y will be 0%, just like g was. But if we look at x we will find that it went from 25 to 35, a gain of 10 and it was supposed to grow by merely 2.5 so it’s overweight is 10/2.5 or 400%. At this point the pattern should be clear:
The overweight number keeps going up as you get closer to the root of the subtree which is the source of the problem. Everything below that will tend to have the same overweight. For instance if the problem is that x is being called one more time by f you’d find that x and all its children have the same overweight number.
This brings us to the second part of the technique. You want to pick a symbol that has a big overweight but is also responsible for a largeish fraction of the regression. So we compute its growth and divide by the total regression cost to get the responsibility percentage. This is important because sometimes you get leaf functions that had 2 samples and grew to 3 just because of sampling error. Those could look like enormous overweights, so you have to concentrate on methods that have a reasonable responsibility percentage and also a big overweight.
Below are some examples as well as the sample program I used to create them and some inline analysis.
Example 1, baseline
The sample program uses a simulated set of call-stacks and costs for its input. Each line represents a call chain and the time in that call chain. So for instance the first line means 5 units in main. The second line means 5 units in f when called by main. Together those would make 10 units of inclusive cost for main and 5 for f. The next line is 5 units in j when called by f when called by main. Main's total goes up to 15 inclusive, f goes to 10, and j begins at 5. This particular example is designed to spread some load all over the tree so that I can illustrate variations from it.
Example 2, in which k costs more when called by f
This one line is changed. Other appearances of k are not affected, just the one place.
Example 3, in which x always costs a little more
All the lines that end in x became 6 instead of 5. Like this:
Example 4, in which f calls j more so that subtree gains cost
All the lines under f/j got one bigger like so:
And finally example 5, in which x gets faster but k gets a lot slower
All the x lines get a little better:
But the k line got worse in two places
Let's see how we do with automated analysis of those things:
Summary of Inclusive times for example 2, in which k costs more when called by f
This gives us the baseline of 90 units for main and you can see how all the "5" costs spread throughout the tree.
You can see that k has gone up a bit here but not much. A straight diff would show you that. However there's more to see. Let's look at the first overweight report.Overweight Report
Before: example 1, baseline
After: example 2, in which k costs more when called by f
Before Time: 90
After Time: 95
Overall Delta: 5
Summary of Inclusive times for example 3, in which x always costs a little more
OK the report clearly shows that k is overweight and so is f. So that gives us a real clue that it's k when called by f that is the problem. And also it's k's exclusive cost that is the problem because all it's normal children have 0% overweight. Not that there is a clear difference between methods with otherwise equal deltas.
Our second example, again you could see this somewhat because x is bigger, but it doesn't really pop here. And many methods seem to have been affected. A straight diff wouldn't tell you nearly as much.
Before: example 1, baseline
After: example 3, in which x always costs a little more
Before Time: 90
After Time: 95
Overall Delta: 5
Well now things are leaping right off the page. We can see that x was the best source of the regression and also that l and k are being implicated. And f and k are bearing equal cost. We can also see that some branches are underweight. The j path is affected more than the k path because of the distribution of calls.Summary of Inclusive times for example 4, in which f calls j more so that subtree gains cost Symbol Inclusive Cost Exclusive Cost main 94 5 f 49 5 j 44 11 g 40 0 k 30 10 x 26 26 y 16 16 z 16 16 l 10 5
Again a straight analysis with so few symbols does evidence the problem, however, it's much clearer below...Overweight Report
Before: example 1, baseline
After: example 4, in which f calls j more so that subtree gains cost
Before Time: 90
After Time: 94
Overall Delta: 4
Summary of Inclusive times for example 5, in which x gets faster but k gets a lot slower
The J method is the worst offender, y and z are getting the same impact due to extra calls from j and j apparently comes from f.
Now we have some soup. It is worse but things are a bit confused. What's going on?
Before: example 1, baseline
After: example 5, in which x gets faster but k gets a lot slower
Before Time: 90
After Time: 105
Overall Delta: 15
Now again things are a lot clearer. Those negative overweights are showing gains where there should be losses. x is helping. And k jumps to the top with a big 360%. And it's 120% responsible for this mess, meaning not only did it cause the regression it also wiped out gains elsewhere.
In practice negatives are fairly common because sometimes costs move from one place to another. Sometimes because of normal things like, in IE, a layout could caused by a timer for paint rather than caused by an explicit request from script, but we still get one layout, so the cost just moved a bit. The overweights would show nothing new in the layout space but a big motion in timer events vs. script cost.
In practice this approach has been very good at finding problems in deep call stacks. It even works pretty good if some of the symbols have been renamed because usually you'll find some symbol that was just above or below the renamed symbol as your starting source for investigation.
Finally you can actually use this technique recursively. Once you find an interesting symbol ("the pivot") that has a big overweight, you regenerate the inclusive costs but ignore any stacks in which the pivot appears. Search for new interesting symbols in what's left the same way and repeat.
The code that generated these reports is here.
As an afterthought I ran an experiment where I did the "recursion" on the last test case. Here are the results:
Note k is gone.Summary of Inclusive times for example 6, in which x gets faster and k is removed Symbol Inclusive Cost Exclusive Cost main 57 5 j 38 10 f 33 5 g 19 0 x 12 12 y 10 10 z 10 10 l 9 5
Note k is goneOverweight Report
Before: example 6, baseline with k removed
After: example 6, in which x gets faster and k is removed
Before Time: 60
After Time: 57
Overall Delta: -3
Overweight analysis leaves no doubt that x is responsible for the gains.
RecapI’m only up to 1980, that’s pretty amazing considering what’s happened in the story so far. The Altair 8800 made its big splash in the January edition of Popular Electronics (which naturally came out in December). Now it’s 1980, only a half-decade later, and we’ve gone from that barely-there device to things even my 2014 eyes would recognize as actual personal computers. Visicalc is available on the Apple II and is changing the way people think about their data, it would soon find its way to the CBM family of machines. A not-especially-wealthy person could reasonably afford to buy a computer, a good letter-quality printer, and all the storage they could stand (in the form of lotsa floppy disks) and even have several choices and different price points that would meet those criteria. Database programs were starting to be a thing but I’d have to say that until DBASE came along later things were pretty much hit-or-miss and there was no clear tool-of-choice.
That amount of progress is just astonishing. One thing that hadn’t changed too much was processor speed. During most of this time machines ran at about 1Mhz and that would remain the case for some time yet. Another thing is the availability of general purpose hi-res graphics. Now it’s not that we didn’t call things hi-res but Apple II’s notion of hires graphics for instance sported a whopping 280x192 pixels with a not-fully-general color display system (which you can read about elsewhere if you really like). Not-fully-general would be pretty typical for some time. Probably until CGA graphics of the PC, which was still a good 18 months away.
Notes On Graphics
I think there’s a pretty simple reason why this was still hard, If you consider a typical good-quality screen at the time, you get about 40 characters by 25. Not too much later 80 column displays become available (wow is that roomy! No more program wrapping!) but I think 40 columns is more fair at this time. OK 40 columns, typically 8 pixels by 8 pixels in a cell. So 320x200. That’s nearly 64k bits or 8k to store all those pixels. Well for starters 8k is a lot of RAM in 1980 but almost as important we have to read the RAM 60 times a second and that gives us about a 480kB/s bandwidth problem – challenging on a memory bus of the day which is 1Mhz and that memory has to dual port. And that was assuming 1 bit per pixel. To get even 4 colors total (CGA quality) you need 2 bits and 16k -- that was pretty much not happening.
On the other hand, downsizing was about to happen in a big way. Err, a small way. If the PET was austere the VIC20 was downright Spartan. At 5k of memory, with only 3.5k available for BASIC you couldn’t fit much code in the thing. But with crazy-double height character mapping you could tile every pixel – at 22x23 you had 506 characters, requiring exactly that many bytes for the storage plus there was a side table of colors for the meaning of the 1 and 0 bits in the main table. Of course there were not quite enough characters to cover the screen like that, but 160 by 160 was possible with 200 double height characters.
VIC20 was wildly popular at $300, not just because of its built in capabilities but because its expansion port accepted a variety of cartridges which could then bring additional capabilities. For starters even a simple game cartridge would have the game in question on ROM so that meant you didn’t need to use any of that precious 3k to store the program. But you could get a whopping 16k cartridge for the thing and have VAST amounts of memory for BASIC programming. The keyboard was totally reasonable – the whole thing looked like a thick-ish keyboard, and there was a custom low-end serial port that was kind of like IEEE488 over serial. And plenty slow. But it worked and it was cheap. Notably VIC20 was the first computer I know with a modem for less than $100. Bucketfulls of those sold as well and Compuserve, The Source, and others suddenly had a huge influx of users. Over a million modems sold.
One of the great things about having a device this inexpensive was that it could be used in a variety of custom applications straight-out-of-the-box instead of deploying custom hardware. In this time at my work we were experimenting with a small breadboard we called the Universal Interface which basically was a 6502 and some standard IO parts for it (a 6522ish thing) plus room for ROM and a breadboard for whatever else we needed and an IEEE488 port that could be repurposed if needed. We’d load it up and it would serve for converting whatever inputs to whatever outputs, usually IEEE488 so a PET could read it. But when space wasn’t a factor, or when video was desired, you could actually deploy a VIC pretty economically, and people did.
Speaking of small computers though, the VIC20 may seem itty bitty to your eyes but it’s downright luxurious compared to the champion of minimalist design – the Sinclair ZX80. This baby featured a swell Z80 processor and sold at $99. (Today I can get a very nice tablet for the same price) It had 1k of memory, and it had very limited video hardware – leaving the heavy lifting to the CPU. At 32x24 characters if you displayed everything you’d only have 384 bytes to work with. Yowsa. In the ZX81 would let your code would run very slowly when displaying video as it could only dedicate cycles to your program during the vertical blank. The ZX80 didn’t even do that, so real time graphics weren’t really an option. But wow, what a slick trick!
So at this point in history, we have the VIC, PET, Apple, TRS80-CoCo which I barely mentioned, ZX80, and maybe some other lesser known ones. We’re 18 months away from the PC and 4 years away from the Mac. By volume VIC20 will dominate, defining home computing for 2 years for many, whilst other offerings are actually pretty much universally superior to an unexpanded VIC. For comparison, something north of 2000 Altair 8800s were sold. A half decade later VIC20 was selling 9000 units a day for about the same price as the 8800 kit.
Even William Shatner got in on the deal…
Use Symbol Filtering to get symbols you care about from your server instead of getting the kitchen sink
One of the most annoying things about working with performance traces is that they include information about everything in the system. That's also one of the great things.
However, most of the time, for most problems, there are very few things you are specifically looking for. Like in my case I'm looking at IE problems and I already know all the IE dlls. If the dude happens to have 82 other programs running and idle I do not need to download their symbols. There is no easy way to specify the symbols I care about so I wrote this local symbol proxy. Beware it's slightly cheesy but it's super helpful because you neither have to download or cache symbols you don't care about. I haven't tested it extensively but it works for me. If you find bugs, maybe you can stick it on a github somewhere and start maintaining it for reals. I feel like I've done my part :)
The source code is here.
As usual it comes with no warrantee and no rights are implied etc. It's just a quick sample I threw together.
To use it, create your own symfilter.txt file with one pattern on each line. It expects to find in the current directory when you start it. Then run the thing. Any requests that do not contain your strings will get a 404 return. Everything else will be forwarded to the indicated server for resolution.
If your server path used to be
change it to
The bit before the | is used to find the actual server, the request will be changed to be http://someserver/somepath/...whatever
You could imagine having several configuration files listening on several ports but I leave that as an exercise to the reader. This thing saves me a ton of time by not fetching symbols I don't need.
from your server