Skip to content

Jimmy Bogard
Syndicate content
Strong opinions, weakly held
Updated: 5 hours 18 min ago

Eventual consistency in REST APIs

Wed, 05/15/2013 - 21:57

Not picking on an API in particular, but…wait, yes I am. Octopus (an awesome product) has a proposed API on GitHub, and one of the things it describes is how to deal with the fact that the backend is built on top of Raven DB, where eventual consistency its default mode for index results.

In it, the proposal includes an “IsStale” flag on the collection, as well as on the query itself, so you could do something like:

GET /api/environments?nonstale=true

Or similar. This presents a rather weird choice to the end user of the API – consistency as a choice, not on mutating operations (PUT/POST/DELETE) but on idempotent GET operations. I presume that this will use the “WaitForNonStaleResults” behavior of RavenDB – but this isn’t really something I’d expect to directly expose to clients.

Without directly exposing our persistence mechanism to our clients (i.e., what if we switched to SQL Server? Redis as a write-through cache to MondoDB? and so on), we have a number of options of dealing with eventual consistency in our REST APIs.

Option 1: Do nothing

The easiest approach for our API is to simply not care. Non-stale results should only really appear when we’re dealing with queries and collections – if you’re dealing with stale resources from GET actions directly against a resource, you’ve really got a weird interaction model.

Looking at some other APIs, like Netflix or GitHub or Trello, you don’t really see any option for influencing the consistency choice on a read.

So we could just ignore it. This is what a lot of APIs do, accept the POST/PUT/DELETE, and make sure that my resource itself is affected correctly. If I do a DELETE /users/jbogard and then a GET /users/jbogard, I expect a 404. But if I query users or search them, then I might not expect those results to be reflected in that model. I can be explicit in this by having search be a completely separate set of resources/entry point (i.e. /search?entity=users&name=jbogard), so it’s completely explicit that search/query is different than interacting with resources.

This is similar to how a library with a card catalog might work. Do I synchronize the activity of checking out a book with checking the catalog? Or might someone have grabbed the book from when I checked the card catalog? Or do I have someone hold the card while checking out a book?

What I don’t do is allow a person looking for a book to either be disappointed OR have the choice of yelling out to everyone in the library, “DOES ANYONE HAVE THIS BOOK OR IS OTHERWISE WANTING TO CHECK THIS BOOK OUT”. I fix my interaction model and just be up front about it.

Option 2: Provide feedback on consistency results

Instead of just punting on whether or not the collection/query is stale, we might provide that as feedback. We could return dates of last update, things like “results as of mm/dd/yyyy”. We often do this on printable reports so that when people print a report (and it is now preeeeeetty much as stale as it gets), we can include a timestamp of when it was printed so that anyone looking at it is informed of how fresh the data is.

Just a tidbit of information to the end user to let them know if their results are up to date or not. If the client made a mutating operation, they can compare the date of this mutation with the date on the query results to make a simple decision to query again, or just move on as planned. Again, the power is in the client not to affect how we query, but what decisions they can make.

Option 3: Return a 202 ACCEPTED status

If we really want to model an asynchronous operation in our REST API, we can look to the HTTP status codes to communicate this explicitly to our clients. We do this in a couple of places where processing is too expensive in the context of a request, so we return a 202 to let clients know that “we got your request, but it’s not getting processed at this moment.” The description from the W3C reads:

The request has been accepted for processing, but the processing has not been completed. The request might or might not eventually be acted upon, as it might be disallowed when processing actually takes place. There is no facility for re-sending a status code from an asynchronous operation such as this.

The 202 response is intentionally non-committal. Its purpose is to allow a server to accept a request for some other process (perhaps a batch-oriented process that is only run once per day) without requiring that the user agent’s connection to the server persist until the process is completed. The entity returned with this response SHOULD include an indication of the request’s current status and either a pointer to a status monitor or some estimate of when the user can expect the request to be fulfilled.

We can return a pointer to an indication of the current status, or some estimate of completion. But we’re being 100% up front that your request is processed asynchronously.

This is similar to any long-running transaction. You place an order at a fast-food restaurant, and are given a correlation identifier that represents your order. You can come back and ask the status of your order at any time. But the cashier certainly doesn’t cook our order while we’re standing in line!
Option 4: Write-through cache

If these operations are expected to succeed (or we verify that the transactional write has succeeded), we can do a simple trick that a lot of high-volume websites do. If we don’t have an AP-system like Dynamo (choosing availability and partitionability over consistency), we might choose consistency instead. To do so, we can cache index results, and write our updated value to that store along with our regular persistent store.

It’s not the most exciting of options, but if we know that writes are much less frequent than reads (and we’re not partitioning, i.e. we pick CA of CAP), then it’s not too far-fetched to write to both our cache and our document store.

Of course, if this is all to force us to bypass our consistency model of the database we’ve chosen, it’s a lot of work. But still, it’s an option nonetheless.

Option 5: Choose a database that matches our consistency needs

If we don’t like the consistency model that our database provides, or we feel like we want to allow clients to choose a consistency model, we might view this as a case where we’ve simply chosen the wrong model. If clients NEED consistency, why not pick something that gives that to them? Or don’t allow them to choose.

This would be like, on writes, allowing clients that interact with a database that uses a relational model to also indicate the isolation level on writes. That’s something that should really be encapsulated by the operation, and made with SLAs and contracts about the behavior.

Conclusion

We have a lot of options here, and all are valid in some scenarios. In each example, we’ve chosen a consistency model that matches our needs, rather than have a compromised consistency model exposed from our database of choice. Before Amazon put out their Dynamo paper, it’s not like us as users of the website knew that this storage system existed in the back end. It was encapsulated. The most we could do was “Ctrl+F5” to force a request back to their servers, but that still had no real guarantees.

Instead of letting our database leak its consistency model to our API, it’s better to choose a model that makes this consistency model explicit, or offer no guarantees at all. But as a rule, I as a consumer should not care that you’ve chosen Raven DB instead of SQL Server for your back end. It’s just none of my business.

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

Categories: Blogs

Saga patterns: wrap up

Tue, 05/14/2013 - 16:08

Posts in this series:

NServiceBus sagas are a simple yet flexible tool to achieve a variety of end goals. Whether it’s orchestration, choreography, business activity monitoring, or just other long-running business transaction variants, the uses of an NServiceBus saga are pretty much endless.

When choosing to go with a centralized workflow, we also saw that there is a cost to centralization with the introduction of a bottleneck. With the routing slip pattern, we can include instructions with our message in a header so that each recipient only needs to reference the attached instructions. In a routing slip, flow is linear, but there’s nothing stopping us from including more advanced instructions for state machines, compensations and so on.

I tend to think of the NServiceBus saga as more of a facility, than a specific pattern, because it doesn’t force us to go in any one direction. Rather than assume a specific role or function for a saga, I like to keep things a bit more flexible, and choose one of the many conversation/messaging patterns available for each given scenario.

In the end, sagas are a useful tool (and one that can be over-used, not every workflow deserves central management), but a nice one to have. Every time I introduce NServiceBus sagas to folks that spent time with other messaging tools, whether it’s big orchestration with ESBs or bare-metal messaging tools, the simplicity and code-centric nature of NServiceBus sagas either excites or depresses, depending on the possibility of switching or introducing new tools.

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

Categories: Blogs

Ditching two-phased commits

Thu, 05/09/2013 - 03:47

I’ve had a love-hate relationship with two-phased commits during my years with messaging. Even if MSDTC was free to set up, it doesn’t come free in terms of throughput. Most people run into 2PC in messaging because because queueing systems and databases are two different resources, and therefore don’t participate in the same transaction. Ideally, I’d have all three participants either succeed or fail together:

image

Since the queues in this picture are different resources than the database, I need to involve a third party, or transaction manager, to coordinate transactions between these three resources.

DTC, when it works, works really well. It’s much, much easier to not care about the consequences of a lack of coordination. In fact, I’d recommend not caring until you actually do care, because ditching two-phased commits does require work. Luckily for us, there are a ton of resources on how to do exactly that!

Most of the time, literature around avoiding 2PC is concerned about an entirely different situation, where I have two separate databases:

image

We’re doing messaging, which means that it’s typically the consumer of the message that does something against other data stores. So even though we’re avoiding communicating with two databases, it’s still two resources, and thus a need to coordinate!

But again, that coordination comes with a cost. A fairly large cost, in some recent testing we found that overall throughput dropped 80%, or to put it another way, ditching DTC saw a 5X increase in throughput. Five fold!

For some systems, that throughput doesn’t matter much, but for those that have a reasonably high volume of messages, or sensitive SLAs, it’s worth investigating alternative approaches.

General rules of thumb

Like most messaging approaches, the ways of avoiding coordination are right in front of our faces. In Gregor Hohpe’s excellent paper on Starbucks, he points out any real-world system that values throughput over absolute consistency avoids distributed transactions. The basic ideas are:

  • Idempotency is king. Get this and you’re halfway home
  • Strategies for dealing with downstream effects is a business decision

Idempotency is absolutely required, but it’s not that hard to apply. For some operations, we can rely on natural idempotency. If I’m asked to turn on the light, receiving the request twice means the same outcome – the light is on! For state machine-like systems, idempotency is a bit easier.

For operations that aren’t naturally idempotent (launch the nuclear missile), we’ll need to get a little creative. If we can identify some correlating information from a request (The president called at 9:15 to launch the missile) or just assign some correlation information (The president has issued request #132 to launch the missile), we can simply keep a journal on the receiving side. If it’s expensive to keep a journal around, we can recycle/trim our journals if they get too big.

Downstream effects become more interesting. If throughput is a high concern, we can rely on compensating actions (customer didn’t have enough money, cancel the order) or more journaling. Instead of sending a message immediately, shouting out messages to downstream systems, we can instead just write down in the same persistent store as our other data another journal for outgoing messages.

Once our local DB transaction is complete, it’s just a matter of sending the messages we’ve written down to send out down the line, and crossing them off our list of “sent” messages. And since downstream systems can deal with at-least-once messages through our idempotency guarantees.

How I learned to stop worrying and ditch 2PC

In some current systems, we’re deciding on a service-by-service basis whether or not we want to enlist or not enlist in distributed transactions. It’s still annoying to try and build a system-wide solution (though the event sourcing guys have this more or less in the bag), so until then, I can just use business decisions to guide me one way or the other.

But it is time to let go and stop worrying so much. Honestly, unless your services have downstream side effects, you can safely turn off DTC if your work is idempotent. If you have downstream side effects, there’s a number of paths to choose from. While I’m not saying goodbye forever (still the best solution if it were absolutely free to use), it is time to shop around.

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

Categories: Blogs

Messages, data and types

Wed, 05/01/2013 - 16:08

One concern I receive quite a bit from folks new to messaging, especially those coming from SOAP and WCF land, is how to preserve the convenience of proxy classes and data contracts that can be shared amongst multiple clients. The problem comes in when looking at coupling, especially around changes in the contract and how to upgrade clients. Clemens Vasters details many of these issues in his screencast on data/contract coupling in messaging.

One thing that grounds all of this is what we consider the message as developers, and what our transport and messaging system considers to be the message. For example, the Azure REST services expose contracts as XML. XML, by itself, and JSON make for great transport formats because the underlying technology provides a fairly universally acceptable common type system. Although your language might need to bridge from your format to theirs, people standardize on ISO formats for standard primitives to maximize interoperability (and minimize serialization mistakes).

Dealing with such large XML documents from a client API can be a pain, however. XML is notoriously finicky with respect to case sensitivity, and I can’t count how many times I’ve been bitten by this when dealing with raw XML and REST APIs. We would need to assume that documentation exists, and until forms become common in REST APIs, it’s difficult to say that HATEOAS will simply solve all these problems of self-describing APIs.

Instead, we often see REST and other messaging clients, out of convenience, build DTOs as a means of representing the message. But first – what is a message? A message is just data. It’s defined by a header and body, where the header is used by the transport/messaging system and the body is ignored (picture courtesy http://www.eaipatterns.com):

However, messages aren’t types. But what about sharing something like this?

[DataContract]
public class PurchaseOrder
{
	private int poId_value;
	
	// Apply the DataMemberAttribute to the property.
	[DataMember]
	public int PurchaseOrderId
	{
	
	    get { return poId_value; }
	    set { poId_value = value; }
	}
}

That’s still not terrible, because underneath the covers our message is still just XML or JSON. We use this type as a description or blueprint of our message, because it’s simpler to describe our message in C# terms instead of a looser type like XML or JSON, which are more difficult to describe and use in C#. I’m ignoring dynamic types in .NET – those to me are a bit of a hack in this case. In WCF, proxy classes get generated on the client side, so we’re still not taking an assembly dependency. This isn’t available in REST or other messaging technologies, leaving clients reliant on documentation to “Get it right” – assuming that they don’t make mistakes translating raw XML or JSON into code building raw XML or JSON on the client side.

So what typically happens in a homogenous environment is that data contract assemblies are shared:

image

Both client and server share a contracts assembly, and use the contracts assembly as a description for how to construct and consume the raw messages.

We introduce coupling on the client side with a raw type shared across the server boundary, but it’s up to those building the system to determine if this sort of coupling introduces any potential risks. When we look at coupling, we must always balance risk. If coupling introduces low risk, it might be acceptable (assuming we’re more or less prescient in our future system’s design).

From my experience, as long as the messaging infrastructure doesn’t assume that the message is built from a type and therefore leak those concerns, this sort of model can be a nice compromise in ecosystems where types as blueprints for messages ensures safety in our message construction and consumption. It’s similar to building MVC applications around View Models – they’re a blueprint for building forms, and a means of accepting raw form POSTs. Side note – I found it hilarious that the Rails folks ran into that mass assignment security problem – it’s a problem I’ve never, ever had in MVC.

But that’s the real kicker – our messaging infrastructure can’t assume types, as the message is not the type. We might use a type as a convenience to build and consume, as we do in MVC, but ultimately, our messaging infrastructure can’t assume a type. MVC handles this quite well, with model metadata and its ModelState objects. The original request is always preserved in its raw form (dictionary of strings), but the model provided to the controller action is an approximation of that request.

It’s only when we assume that we’ve literally shared types that we’re going to slip into real type coupling, and everything that SOAP failed to deliver comes back again.

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

Categories: Blogs

Saga alternatives – routing slips

Fri, 04/26/2013 - 21:49

In the last few posts on sagas, we looked at a variety of patterns of modeling long-running business transactions. However, the general problem of message routing doesn’t always require a central point of control, such as is provided with the saga facilities in NServiceBus. Process managers offer a great deal of flexibility in modeling complex business processes and splitting out concerns. They come at a cost though, with the shared state and single, centralized processor.

Back in our sandwich shop example, we had a picture of an interaction starting a process and moving down the line until completion:

image

Not quite clear in this picture is that if we were to model this process as a saga, we’d have a central point in which all messages must flow to for decisions to be made on the next step. But is there really any decision to be made in the picture above? In a true saga, we have a sequence of steps and a set of compensating actions (in a very simplistic case). Many times, however, there’s no need to worry about compensations in case of failures. Nor does the order in which we do things change much.

Humans have already found that assembly lines are great ways of breaking down a long process into individual steps, and performing those steps one at a time. Henry Ford’s Model T rolled off the assembly line every 3 minutes. If only one centralized worker coordinated all steps, it’s difficult to imagine how this level of throughput could be achieved.

The key differentiator is that there’s nothing really to coordinate – the process of steps is well-defined and known up front, and individual steps shouldn’t need to make decisions about what’s next. Nor is there a need for a central controller to figure out the next step – we already know the steps up front!

In our sandwich model, we need to tweak our picture to represent the reality of what’s going on. Once I place my order, the sequence of steps to fulfill my order are known up front, based on simply examining my order. The only decision to be made is to inspect the order and write the steps down. My order then flows through the system based on the pre-defined steps:

image

 

Each step doesn’t change the order, nor do they decide what the next step is (or even care who the next step is). Each step’s job is to simply perform its operation, and once completed, pass the order to the next step.

Not all orders have the same set of steps, but that’s OK. As long as the steps don’t deviate from the plan once started, we don’t need to have any more “smarts” in our steps.

It turns out this pattern is a well-known pattern in the messaging world (which, in turn, borrowed its ideas from the real world): the routing slip pattern.

Routing slips in NServiceBus

Routing slips don’t exist in NServiceBus, but it turns out it’s not too difficult to implement. A routing slip represents the list of steps (and the progress), as well as a place for us to stash any extra information needed further down the line:

    public interface IRoutingSlip
    {
        Guid Id { get; }
        IEnumerable<IProcessingStep> ProcessingSteps { get; }
        IDictionary<string, string> Attachments { get; }
    }

We can attach our routing slip to the original message, so that each step can inspect the slip for the next step. We’ll kick off the process when we first send out the message:

Bus.Route(sandwichOrder, new[]
{
	"Preparation",
	"Oven",
	"Packing",
});

Each handler handles the message, but doesn’t really need to do anything to pass it down the line, we can do this at the NServiceBus infrastructure level.

public class PackingHandler 
	: IHandleMessages<SandwichOrder>
{
	public void Handle(SandwichOrder message)
	{
	    // pack the sandwich
	}
}

The nice aspect of this model is that it eliminates any centralized control. The message flows straight through the set of queues – leaving out any potential bottleneck our saga implementation would introduce. Additionally, we don’t need to resort to things like pub-sub – since this would still force our steps to be aware of the overall order, hard-coding who is next in the chain. If a customer doesn’t toast their sandwich – it doesn’t go through the oven, but we know this up front! No need to have each step to know both what to do and what the next step is.

I put the message routing implementation up on NuGet and GitHub, you just need to enable it on each endpoint via configuration:

public class Startup : IWantCustomInitialization
{
	public void Init()
	{
	    Configure.Instance.RoutingSlips();
	}
}

If you need to process a message in a series of steps (known up front), and want to keep individual steps from knowing what’s next (or introduce a central controller), the routing slip pattern could be a good fit.

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

Categories: Blogs

Scaling NServiceBus Sagas

Tue, 03/26/2013 - 15:57

When looking at NServiceBus sagas (process managers), especially at high volume of messages, we often run into two main problems:

  • Deadlocks
  • Starvation

This is because of the fundamental (default) design of sagas is that:

  • A single saga shares a single saga entity (causing deadlocks)
  • All messages handled by a saga are delivered to the same endpoint (causing a bottleneck)

Two very different problems, with very different solutions. But, as per usual, we’ll look at how we might handle these situations in the real life to see how our virtual world should behave. Both of these problems (amongst others) are called out in the Enterprise Integration Patterns book, so it shouldn’t be a surprise if you run into these issues. Not as clear, however, is how to deal with them.

Deadlocks

Deadlocks in sagas are pretty easy to run into if we start pumping up the concurrency on our sagas. Back in our McDonald’s example, we published a message out and waited for all of the stations to report back:

image

In our restaurant, there’s only room for one person to examine a single tray at a time. When someone finishes a food item, they have to go look at all the trays. But what if someone is putting food on my tray? Do I push them out of the way?

Probably not, I need to wait for them to finish. It turns out that this is what happens in our sagas – we can only really allow one person at a time to affect a saga instance. The way NServiceBus can handle this by default is by adjusting the isolation level for Serializable transactions. This ensures that only one person looks at a tray at any given time.

What we don’t want to have happen is the fries person look at the tray, see a missing sandwich, and decide that the order isn’t done. The sandwich person does the same thing in reverse – look at the tray, see missing fries, and decide the order isn’t done.

It turns out that our Serializable isolation level has unintended consequences – we might lock more trays (or ALL of the trays) unintentionally. If there are more than one message that can start our saga, then effectively every employee has to look at all the trays to see if our tray is already on the counter, or we’re the first one and need to put a new one out. But if someone is also doing the same thing – multiple trays for the same order might go out!

We don’t want to block the entire counter for our order, so we can adjust our scheme slightly. If we implement a simple versioning scheme, we can simply have the employees look at a version number on the receipt, make their changes, and bump the version number with a little tally mark. If the tally mark has gone up since we last looked at the tray, we’ll just re-do our work – something has changed, so let’s reevaluate our decisions.

To get around the problem of too many trays for the same order, we can rely on a third party to ensure we don’t have too many trays. If when anyone places a tray down, the supervisor enforces this rule that only one tray per order can exist, they’ll ask us to put our tray back up and redo our work.

Two things we want to do then:

  • Introduce a unique constraint on the correlation identifier to prevent duplicates
  • Use optimistic concurrency to allow us to isolate just our own tray (and not affect anyone else)

This is a rather straightforward fix – and in fact, NServiceBus defaults to using optimistic concurrency in 4.0 (now in beta). This was a rather easy fix, but what about our other problem of the bottleneck?

Starvation and bottlenecks

Back in my early days of school, I worked at a fast food restaurant during breaks (surprise, surprise). I worked the graveyard shift of a 24-hour burger joint, and it was just me and one other person manning the entire place. From 10PM to 6AM, just two of us.

One thing that I found pretty early was that we had quite a few “regulars”. These were folks that every morning, came in and ordered the same thing at around the same time. Because it was quite early, they could rely on little to no contention for resources, and the queue was more or less empty.

Most of the time.

But every once in a while, we would get someone ordering food for a large group of people. We didn’t do catering, but we did sell breakfast tacos. Handling one or two at a time wasn’t a problem, but someone coming in and ordering 50 tacos – that’s a problem. Our process looked something like this:

image

The problem was, because there were only two of us, most messages flowed through exactly one channel. In NServiceBus Sagas, this is what happens as well – all messages for a saga are delivered to a single endpoint. We might have the situation like this:

Saga Queue Prep Taco – #1 Prep Taco – #1 Prep Taco – #1 Prep Taco – #1 Prep Taco – #1 Prep Taco – #1 Prep Taco – #1 Prep Pancakes – #2

My cook has dozens of tacos to go through before getting to that second order, and because that person packs all the food and preps all the food, all items from the first order must be both prepped and packed before order #2 is complete:

Saga Queue Pack Taco – #1 Pack Taco – #1 Pack Taco – #1 Pack Taco – #1 Pack Taco – #1 Pack Taco – #1 Pack Taco – #1 Pack Pancakes – #2

That’s not a huge problem, except we have both packing and prepping done by a single person. Another order comes in, the prepping for that one interferes with the others. Additionally, items are sitting in the saga queue waiting to get packed while prepping finishes.

In many NServiceBus sagas, the sagas themselves don’t perform the actual work. They’re instead the controllers/coordinators, making decisions about next steps. But all the messages are delivered to the exact same queue, effectively creating a bottleneck. I’m waiting for all prepping to be done before moving on to any packing. The time it takes to this entire set of orders is basically the time it takes to prep all items plus the time it takes to pack all items.

The problem is our saga is modeled rather strangely. In the real world, long-running business processes, split into multiple activities/steps, are performed many times by different people. Effectively, different channels/endpoints/queues. But our saga shoved everyone back into one queue!

What we did in our taco tragedy above was that I came off the line from taking orders and performed the prep stage. For N items, we took a total time of N+1 or so. We had two queues going for the one saga:

Prep Queue Prep Taco – #1 Prep Pancakes – #2 Pack Queue Pack Taco – #1 Pack Taco – #1

The total number of items in the Pack queue was always small – packing was easy and quick. By splitting the saga handling into separate queues, we ensured that an overload in one type of message didn’t choke out all the others. This isn’t the default way of managing long-running processes in NServiceBus though, we have to set this up ourselves.

When dealing with the Process Manager pattern, it’s important to keep in mind what this pattern brings to the table. We’re able to support complex message flow, but at the price of potential bottlenecks and deadlocks.

Next time – a better pattern for linear processes that doesn’t bring in deadlocks and bottlenecks, the routing slip pattern.

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

Categories: Blogs