Skip to content

Jimmy Bogard
Syndicate content
The Barley Architect
Updated: 6 min 59 sec ago

Refactoring Towards Resilience: Process Manager Solution

Mon, 02/27/2017 - 23:30

Other posts in this series:

In the last post, we examined all of our coordination options as well as our process coupling to design a solution for our order processor. In my experience, it's this part of design async workflows that's by far the hardest. The code behind building these workflows is quite simple, but getting to that code takes asking tough questions and getting real answers from the business about how they want to handle the coupling and coordination options. There's no right or wrong in the answers, just tradeoffs.

To implement our process manager that will handle coordination/choreography with the external services, I'm going with NServiceBus. My biggest reason to do so is that NServiceBus, instead of being a heavyweight broker, acts as an implementor of most of the patterns listed in the Enterprise Integration Patterns catalog, and for nearly all my business cases I don't want to implement those patterns myself. As a refresher, our final design picture looks like:

We've already gone over the API/Task generation side, the final part is to build the process manager, the Stripe payment gateway, and event handlers (including SendGrid).

In terms of project structure, I still include Stripe as part of my overall order "service" boundary, so I have no qualms including it in the same solution as my process manager. With that in mind, let's look first at our order process manager, implemented as an NServiceBus saga.

Initial Order Submit

From the last post, we saw that the button click on the UI would create an order, but defer to backend processing for actual payments. Our process manager responds to the front-end command to start the order processing:

public async Task Handle(ProcessOrderCommand message,  
    IMessageHandlerContext context) {
    var order = await _db.Orders.FindAsync(message.OrderId);

    await context.Send(new ProcessPaymentCommand
    {
        OrderId = order.Id,
        Amount = order.Total
    });
}

When we receive the command to process the order, we send a command to our Stripe processor from our Saga, defined as:

public class OrderAcceptanceSaga : Saga<OrderAcceptanceData>,  
    IAmStartedByMessages<ProcessOrderCommand>,
    IHandleMessages<ProcessPaymentResult>
{
    private readonly OrdersContext _db;

    public OrderAcceptanceSaga(OrdersContext db)
    {
        _db = db;
    }
    protected override void ConfigureHowToFindSaga(
        SagaPropertyMapper<OrderAcceptanceData> mapper)
    {
        mapper.ConfigureMapping<ProcessOrderCommand>(m => m.OrderId);
    }

It doesn't seem like much in our process, we just turn around and send a command to Stripe, but that means that our front end has successfully recorded the order. With our initial command sent, let's check out our Stripe side.

Stripe processing

On the Stripe side, we said that payments are an Order service concern, which means I'm happy letting payments be a command. Between services, I prefer events, and internal to a service, commands are fine (events are fine too, I just prefer to coordinate/orchestrate inside a service).

We can implement a fairly straightforward Stripe handler, using a Stripe API NuGet package to help with the communication side:

public async Task Handle(ProcessPaymentCommand message,  
    IMessageHandlerContext context)
{
    var order = await _db.Orders.FindAsync(message.OrderId);

    var myCharge = new StripeChargeCreateOptions
    {
        Amount = Convert.ToInt32(order.Total * 100),
        Currency = "usd",
        Description = message.OrderId.ToString(),
        SourceCard = new SourceCard
        {
            /* get securely from order */
            Number = "4242424242424242",
            ExpirationYear = "2022",
            ExpirationMonth = "10",
        },
    };

    var requestOptions = new StripeRequestOptions
    {
        IdempotencyKey = message.OrderId.ToString()
    };

    var chargeService = new StripeChargeService();

    try
    {
        await chargeService.CreateAsync(myCharge, requestOptions);

        await context.Reply(new ProcessPaymentResult {Success = true});
    }
    catch (StripeException)
    {
        await context.Reply(new ProcessPaymentResult {Success = false});
    }
}

Most of this is fairly standard Stripe pieces, but the most important part is that when we call the Stripe API, we track success/failure and return a result appropriately. Additionally, we pass in the idempotency key based on the order ID so that if something goes completely wonky here and our message retries, we don't charge the customer twice.

We could get quite a bit more complicated here, looking at retries and the like but this is good enough for now and at least fulfills our goal of not accidentally charging the customer twice, or charging them and losing that information.

Handling the Stripe response

Back in our Saga, we need to handle the response from Stripe and perform any downstream actions. Now since we have this issue of the order successfully getting received but payment failing, we need to track that. I've handled this just by including a simple flag on the order and publishing a separate message:

public async Task Handle(ProcessPaymentResult message,  
    IMessageHandlerContext context)
{
    var order = await _db.Orders.FindAsync(Data.OrderId);

    if (message.Success)
    {
        order.PaymentSucceeded = true;

        await context.Publish(new OrderAcceptedEvent
        {
            OrderId = Data.OrderId,
            CustomerName = order.CustomerName,
            CustomerEmail = order.CustomerEmail
        });
    }
    else
    {
        order.PaymentSucceeded = false;

        await context.Publish(new OrderPaymentFailedEvent
        {
            OrderId = Data.OrderId
        });
    }

    await _db.SaveChangesAsync();

    MarkAsComplete();
}

Depending on the success or failure of the payment, I mark the order as payment succeeded and publish out a requisite event. Not that complicated, but this decoupling of the process from Stripe itself means that when I notify downstream systems, I'm only doing so after successfully processing the Stripe call (but not that the Stripe call itself was successful).

SendGrid event subscriber

Finally, our publishing of the OrderAcceptedEvent means we can build a subscriber to then send out the email to the customer that their order was successfully processed. Again, I'll use a NuGet package for the SendGrid API to do so:

public class OrderAcceptedHandler  
    : IHandleMessages<OrderAcceptedEvent>
{
    public Task Handle(OrderAcceptedEvent message, 
        IMessageHandlerContext context)
    {
        var apiKey = Environment.GetEnvironmentVariable("MY_RAD_SENDGRID_KEY");
        var client = new SendGridClient(apiKey);
        var msg = new SendGridMessage();

        msg.SetFrom(new EmailAddress("no-reply@my-awesome-store.com", "No Reply"));
        msg.AddTo(new EmailAddress(message.CustomerEmail, message.CustomerName));
        msg.SetTemplateId("0123abcd-fedc-abcd-9876-0123456789ab");
        msg.AddSubstitution("-name-", message.CustomerName);
        msg.AddSubstitution("-order-id-", message.OrderId.ToString());

        return client.SendEmailAsync(msg);
    }
}

Again, not too much excitement here, I'm just sending an email. The interesting part is the email sending is now temporally decoupled from my ordering process. In fact, email notifications are just another subscriber so we can easily imagine this sort of communication living not in the ordering service but perhaps a CRM service instead.

Wrapping it up

Our process we designed so far is pretty simple, just decoupling a few external processes from a button click. With an NServiceBus Saga in place to act as a process manager, our possibilities for more complex logic around the order acceptance process grow. We can retry payments, do more complicated order acceptance checks like fraud detection or address verification.

Regardless, we've addressed our initial problems in the distributed disaster we created earlier. It took quite a few more lines of code and more moving pieces, but that's always been my experience. Resilience is a feature, and one that has to be carefully considered and designed.

Categories: Blogs

Refactoring Towards Resilience: Async Workflow Options

Fri, 02/17/2017 - 23:44

Other posts in this series:

In the last post, we looked at coupling options in our 3rd-party resources we use as part of "button-click" place order, and whether or not we truly needed that coupling or not. As a reminder, coupling is neither good nor bad, it's the side-effects of coupling to the business we need to evaluate in terms of desirable and undesirable, based on tradeoffs. We concluded that in terms of what we needed to couple:

  • Stripe: minimize checkout fallout rate; process offline
  • Sendgrid: Send offline
  • RabbitMQ: Send offline

Basically, none of our actions we determined to need to happen right at button click. This doesn't hold true for every checkout page, but we can make that assumption for this example.

Sidenote - in the real-life version of this, we opted for Stripe - undo, SendGrid - ignore, RabbitMQ - ignore, with offline manual re-sending based on alerts.

With this in mind, we can design a process that manages the side effects of an order placement separate from placing the order itself. This is going to make our process more complicated, but distributed system tend to be more complicated if we decide we don't want to blissfully ignore failures.

Starting the worfklow

Now that we've decided we can process our three resources out-of-band, the next question becomes "how do I signal to that back-end processing to do its work?" This largely depends on what your backend includes, if you're on-prem or cloud etc etc. Azure for example offers a number of options for us for "async processing" including:

  • Azure WebJobs
  • Azure Service Bus
  • Azure Scheduler

In my situation, we weren't deploying to Azure so that wasn't an option for us. For on-prem, we can look at:

From these three, I'm inclined towards Hangfire as it's easy to integrate into my Web API/MVC/ASP.NET Core app. A single background job executing say, once a minute, can check for any pending messages and send them along:

RecurringJob.AddOrUpdate(() => {  
    using (var db = new CartContext()) {
        var unsent = db.OutboxMessages.ToList();
        foreach (var msg in unsent) {
            Bus.Send(msg);
            db.OutboxMessages.Delete(msg);
        }
        db.SaveChanges();
    }
}, "*/1 * * * *");

Not too complicated, and this will catch any unsent messages from our API that we tried to send after the DB transaction. Once a minute should be quick enough to catch unsent messages and still not have it seem like to the end user that they're missing emails.

Now that we've got a way to kick off our workflow, let's look at our workflow options themselves.

Workflow options

There's still some ordering I need to enforce on my external resource operations, as I don't want emails to be sent without payment success. Additionally, because of the resilience options we saw earlier, I don't really want to couple each operation together. Because of this, I really want to break my workflow in multiple steps:

In our case, we can look at three major workflows: Routing Slip, Saga, and Process Manager. The Process Manager pattern can further break down into more detailed patterns, from the Microservices book "Choreography" and "Orchestration", or as I detailed them a few years back, "Controller" and "Observer".

With these options in mind, let's look at each in turn to see if they would be appropriate to use for our workflow.

Routing Slip

Routing slip is an interesting pattern that allows each individual step in the process to be decoupled from the overall process flow. With a routing slip, our process would look like:

We start create a message that includes a routing slip, and include a mechanism to forward along:

Bus.Route(msg, new [] {"Stripe", "SendGrid", "RabbitMQ"}  

Where "RabbitMQ" is really just and endpoint at this point to publish a "OrderComplete" message.

From the business perspective, does this flow make sense? Going back to our coordination options for Stripe, under what conditions should we flow to the next step? Always? Only on successful payment? How do we handle failures?

The downside of the Routing Slip pattern is it's quite difficult to introduce logic to our workflow, handle failures, retries etc. We've used it in the past successfully, and I even built an extension to NServiceBus for it, but it tends to fall down in our scenario especially around Stripe where I might need to do an refund. Additionally, it's not entirely clear if we publish our "Order Complete" message when SendGrid is down. Right now, it doesn't look good.

Saga

In the saga pattern, we have a series of actions and compensations, the canonical example being a travel booking. I'm booking together a flight, car, and hotel, and only if I can book all 3 do I call my travel booking "complete":

//vasters.com/archive/Sagas.html

Does a Saga make sense in my case? From our coordination examination, we found that only the Stripe resource had a capability of "Undo". I can't "Undo" a SendGrid call, nor can I "Undo" a message published.

For this reason, a Saga doesn't make much sense. Additionally, I don't really need to couple my process together like this. Sagas are great when I have an overall business transaction that I need to decompose into smaller, compensate-friendly transactions. That's clearly not what I have here.

Process Manager - Orchestration/Controller

Our third option is a process manager that acts as a controller, orchestrating a process/workflow from a central point:

Now, this still doesn't make much sense because we're coupling together several of our operations, making an assumption that our actions need to be coordinated in the first place.

So perhaps let's take a step back at our process, and examine what actually needs to be coordinated with what!

Process Examination

So far we've looked at process patterns and tried to bolt them onto our steps. Let's flip that, and go back to our original flow. We said our process had 4 main parts:

  1. Save order
  2. Process payment
  3. Email customer
  4. Notify downstream

From a coupling perspective, we said that "Process Payment" must happen only after I save the order. We also said that "Email customer" must happen only if "Process Payment" was successful. Additionally, our "notify downstream" step must only happen if our order successfully processed payment.

Taking a step back, isn't the "email customer" a form of notifying downstream systems that the order was created? Can we just make an additional consumer of the OrderCreatedEvent be one that sends onwards? I think so!

But we still have the issue of payment failure, so our process manager can handle those cases as well. And since we've already made payments asynchronous, we need some way to signal to the help team that an order is in a failed payment state.

With that in mind, our process will be a little of both, orchestration AND choreography:

We treat the email as just another subscriber of our event, and our process manager now is only really concerned about completing the order.

In our final post, we'll look at implementing our process manager using NServiceBus.

Categories: Blogs

Refactoring Towards Resilience: Evaluating Coupling

Tue, 02/14/2017 - 23:25

Other posts in this series:

So far, we've been looking at our options on how to coordinate various services, using Hohpe as our guide:

  • Ignore
  • Retry
  • Undo
  • Coordinate

These options, valid as they are, make an assumption that we need to coordinate our actions at a single point in time. One thing we haven't looked at is breaking the coupling of our actions, which greatly widens our ability to deal with failures. The types of coupling I encounter in distributed systems (but not limited to) include:

  • Behavioral
  • Temporal
  • Platform
  • Location
  • Process

In our code:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

Of the coupling types we see here, the biggest offender is Temporal coupling. As part of placing the order for the customer's cart, we also tie together several other actions at the same time. But do we really need to? Let's look at the three external services we interact with and see if we really need to have these actions happen immediately.

Stripe Temporal Coupling

First up is our call to Stripe. This is a bit of a difficult decision - when the customer places their order, are we expected to process their payment immediately?

This is a tough question, and one that really needs to be answered by the business. When I worked on the cart/checkout team of a Fortune 50 company, we never charged the customer immediately. In fact, we did very little validation beyond basic required fields. Why? Because if anything failed validation, it increased the chance that the customer would abandon the checkout process (we called this the fallout rate). For our team, it made far more sense to process payments offline, and if anything went wrong, we'd just call the customer.

We don't necessarily have to have a black-and-white choice here, either. We could try the payment, and if it fails, mark the order as needing manual processing:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    try {
        var payment = await stripeService.PostPaymentAsync(order);
    } catch (Exception e) {
        Logger.Exception(e, $"Payment failed for order {order.Id}");
        order.MarkAsPaymentFailed();
    }
    if (!order.PaymentFailed) {
        await sendGridService.SendPaymentSuccessEmailAsync(order);
    }
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

There may also be business reasons why we can't process payment immediately. With orders that ship physical goods, we don't charge the customer until we've procured the product and it's ready to ship. Otherwise we might have to deal with refunds if we can't procure the product.

There are also valid business reasons why we'd want to process payments immediately, especially if what you're purchasing is digital (like a software license) or if what you're purchasing is a finite resource, like movie tickets. It's still not a hard and fast rule, we can always build business rules around the boundaries (treat them as reservations, and confirm when payment is complete).

Regardless of which direction we go, it's imperative we involve the business in our discussions. We don't have to make things technical, but each option involves a tradeoff that directly affects the business. For our purposes, let's assume we want to process payments offline, and just record the information (naturally doing whatever we need to secure data at rest).

SendGrid Temporal Coupling

Our question now is, when we place an order, do we need to send the confirmation email immediately? Or sometime later?

From the user's perspective, email is already an asynchronous messaging system, so there's already an expectation that the email won't arrive synchronously. We do expect the email to arrive "soon", but typically, there's some sort of delay. How much delay can we handle? That again depends on the transaction, but within a minute or two is my own personal expectation. I've had situations where we intentionally delay the email, as to not inundate the customer with emails.

We also need to consider what the email needs to be in response to. Does the email get sent as a result of successfully placing an order? Or posting the payment? If it's for posting the payment, we might be able to use Stripe Webhooks to send emails on successful payments. In our case, however, we really want to send the email on successful order placement not order payment.

Again, this is a business decision about exactly when our email goes out (and how many, for what trigger). The wording of the message depends on the condition, as we might have a message for "thank you for your order" and "there was a problem with your payment".

But regardless, we can decouple our email from our button click.

RabbitMQ Coupling

RabbitMQ is a bit of a more difficult question to answer. Typically, I generally assume that my broker is up. Just the fact that I'm using messaging here means that I'm temporally decoupled from recipients of the message. And since I'm using an event, I'm behaviorally decoupled from consumers.

However, not all is well and good in our world, because if my database transaction fails, I can't un-send my message. In an on-premise world with high availability, I might opt for 2PC and coordinate, but we've already seen that RabbitMQ doesn't support 2PC. And if I ever go to the cloud, there are all sorts of reasons why I wouldn't want to coordinate in the cloud.

If we can't coordinate, what then? It turns out there's already a well-established pattern for this - the outbox pattern.

In this pattern, instead of sending our messages immediately, we simply record our messages in the same database as our business data, in an "outbox" table":

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    dbContext.SaveMessage(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

Internally, we'll serialize our message into a simple outbox table:

public class Message {  
    public Guid Id { get; set; }
    public string Destination { get; set; }
    public byte[] Body { get; set; }
}

We'll serialize our message and store in our outbox, along with the destination. From there, we'll create some offline process that polls our table, sends our message, and deletes the original.

while (true) {  
    var unsentMessages = await dbContext.Messages.ToListAsync();
    var tasks = new List<Task>();
    foreach (var msg in unsentMessages) {
        tasks.Add(bus.SendAsync(msg)
           .ContinueWith(t => dbContext.Messages.Remove(msg)));
    }
    await Task.WhenAll(tasks.ToArray());
}

With an outbox in place, we'd still want to de-duplicate our messages, or at the very least, ensure our handlers are idempotent. And if we're using NServiceBus, we can quite simply turn on Outbox as a feature.

The outbox pattern lets us nearly mimic the 2PC coordination of messages and our database, and since this message is a critical one to send, warrants serious consideration of this approach.

With all these options considered, we're now able to design a solution that properly decouples our different distributed resources, still satisfying the business goals at hand. Our next post - workflow options!

Categories: Blogs

Refactoring Towards Resilience: Evaluating RabbitMQ Options

Fri, 02/10/2017 - 19:50

Other posts in this series:

In the last post, we looked at dealing with an API in SendGrid that basically only allows at-most-once calls. We can't undo anything, and we can't retry anything. We're going to find some similar issues with RabbitMQ (although it's not much different than other messaging systems).

RabbitMQ, like all queuing systems I can think of, offer a wide variety of reliability modes. In general, I try to make my message handlers idempotent, as it enables so many more options up stream. I also don't really trust anyone sending me messages so anything I can do to ensure MY system stays consistent despite what I might get sent is in my best interest.

Looking back at our original code:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

We can see that if anything fails after the "bus.Publish" line, we don't really know what happened to our message. Did it get sent? Did it not? It's hard to tell, but going to our picture of our transaction model:

Transaction flow

And our options we have to consider as a reminder:

Coordination Options

Let's take a look at our options dealing with failures.

Ignore

Similar to our SendGrid solution, we could just ignore any failures with connecting to our broker:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    try {
        await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    } catch (Exception e) {
        Logger.Exception(e, $"Failed to send order created event for order {order.Id}");
    }
    return RedirectToAction("Success");
}

This approach would shield us from connectivity failures with RabbitMQ, but we'd still need some sort of process to detect these failures and retry those sends later on. One way to do this would be simply to flag our orders:

} catch (Exception e) {
    order.NeedsOrderCreatedEventRaised = true;
    Logger.Exception(e, $"Failed to send order created event for order {order.Id}");
}    

It's not a very elegant solution, as I'd have to create flags for every single kind of message I send. Additionally, it ignores the issue of a database transaction rolling back, but my message is still sent. In that case, my message will still get sent, and consumers could get events for things that didn't actually happen! There are other ways to fix this - but for now, let's cover our other options.

Retry

Retries are interesting in RabbitMQ because although it's fairly easy to retry my message on my side, there's no guarantee that consumers can support a message if it came in twice. However, in my applications, I try as much as possible to make my message consumers idempotent. It makes life so much easier, and allows so many more options, if I can retry my message.

Since my original message includes the unique order ID, a natural correlation identifier, consumers can have an easy way of ensuring their operations are idempotent as well.

The mechanics of a retry could be similar to our above example - mark the order as needing a retry of the event to be raised at some later point in time, or retry in the same block, or include a resiliency layer on top of sending.

Undo

RabbitMQ doesn't support any sort of "undo" natively, so if we wanted to do this ourselves, we'd have to have some sort of compensating event published. Perhaps an event, "OrderNotActuallyCreatedJustKiddingAboutBefore"?

Perhaps not.

Coordinate

RabbitMQ does not natively support any sort of two-phase commit, so coordination is out.

Next steps

Now that we've examined all of our options around the various services our application integrates with, I want to evaluate each service in terms of the coupling we have today, and determine if we truly need that level of coupling.

Categories: Blogs

Refactoring Towards Resilience: Evaluating SendGrid Options

Thu, 02/02/2017 - 17:26

Other posts in this series:

In the last post, we found we had the full gamut of options for coordinating our distributed activity:

Coordination Options

We could ignore the failure (and have money floating in the ether), retry using an idempotency key, undo via a refund, or coordinate using the auth/capture flow Stripe provides. That's quite helpful that we have so many options available, it lets us be flexible in our approach. Part of this is the design of the Stripe API, and part is just the nature of payments.

SendGrid, a service to send email, naturally won't have as many options. The purpose of SendGrid is to send emails when you call their API, and to send that email as quickly as possible. You shouldn't call their API unless you want an email sent. With this in mind, let's look at our coordination options.

Ignore

With Stripe, the Ignore option wasn't really viable. We need to charge customers, but we don't want to charge them twice. With an email notification, it's not necessarily as critical.

In our original code, any SendGrid failure caused my entire process to fail:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

But what if it doesn't have to be this way? Does a SendGrid failure really need to take my entire payment process down? Should I not take people's money because I can't send them an email? Maybe not! The simplest approach would be just to ignore (and log) failures:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    try {
        await sendGridService.SendPaymentSuccessEmailAsync(order);
    } catch (Exception e) {
        Logger.Exception(e, $"Failed to send payment email for order {order.Id}");
    }
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

We'd still want to send emails, but we can be notified of these exceptions and have some manual (or automated) process to retry emails.

Retry

Unlike Stripe, SendGrid does not support any sort of idempotency in requests. This is intentional on their part, they want to send emails as soon as possible and introducing any sort of idempotency check would introduce a delay in the send.

All this means that Retrying a SendGrid call would result in two emails sent. Is this acceptable? Probably not - if I received two "payment successful" emails, I would rightly freak out and assume my card was charged twice.

Fundamentally, my emails are SMTP, another messaging protocol, which just doesn't support something like idempotent Retry. If I wanted true retries, I would need to build a resiliency layer on top of sending emails, something that SendGrid (and most email service providers I can think of) do not do.

No retries.

Undo

Similar to Undo, there's really no such thing as "undoing" an email sent. Perhaps a follow-up email "ignore that last email", but that's about it.

SendGrid does however support scheduling emails in a batch with an ID, and cancelling a batch. However, there are limitations to batches sent, and batches are as the name implies a means to send a bunch of emails at a time in a batch.

I don't really want to be tied to scheduling a batch of emails together, as the customer expectation is to get an email relatively soon after send, and I can't create a batch for every single email.

No undos.

Coordinate

Similar to Retry, there's really no facility to coordinate an email send with some sort of two-phase commit process. I'd have to build something on top of calling the SendGrid API to coordinate this action.

No coordination.

So where does that leave us? In our current state, we can either Retry or Ignore. Neither are great solutions, but we'll come back soon on better ways to tackle this problem with messaging.

Next up, RabbitMQ failures!

Categories: Blogs

Refactoring Towards Resilience: Evaluating Stripe Options

Wed, 02/01/2017 - 22:44

Other posts in this series:

In the last post, I looked at a common problem in web applications - accepting payments via 3rd party APIs (in this case, Stripe). We saw a fundamental issue in our design - because Stripe does not participate in our transactions, any failures after a call to Stripe will result in money being lost in the ether. For reference, our original code:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

How can we address this "missing money" problem? With each interaction, we have four basic options:

Coordination Options

We need to look at each of our Stripe interactions and based on which of these options are available, decide what coordination action we need to take. Above all though, we need to make sure the end user's expectations are still met. It's all for naught if we implement a back-end process that isn't explicitly communicated to the user.

Up until this point, our code picked option #1, "Ignore" for any failures. If a later failure occurred, we ignored the result and our action remained "successful". We can do better of course, and Stripe have a number of options available to use. Let's look at these options one at a time.

Retry

First, what about "Retry"? We could retry our Stripe payment, but that would result in two payments issues to the customer! Not exactly what we want. It turns out, however, that Stripe can handle idempotent requests by passing in some sort of client-generated idempotency key. From the docs, idempotency keys passed to Stripe expire within 24 hours. This means we can safely retry as many times as we want within 24 hours.

For our idempotency key, we can look at something unique in the cart, perhaps the Cart's ID? In this case, our code would look slightly different in our payments:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order, model.CartId);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");
}

With this in place, we can retry our action and our payment will be posted only once. But wait - who is retrying? What's the mechanism for our payment to be retried? We have a couple of options here. First, we can bake in some sort of "retry" in our Stripe service. Try X number of times, and finally fail and throw an exception if the payment didn't work.

That won't help in the case of a subsequent step failure, with SendGrid or RabbitMQ, however. If we wanted to retry the entire action, we'd likely need something on the client side to detect failures and retry the POST as necessary. In reality, we'd want to have more control over this retry process than driving it from the browser.

Undo

Undo is interesting in that it allows to perform a compensating action in case of subsequent failures. The "undo" action is highly dependent on what our action is in the first place. In the case of a Stripe payment, what would an "undo" action be? Like all payment gateways I'm aware of, Stripe allows the ability to refund any transaction. We only need to refund the customer if anything goes wrong:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    StripePayment payment = null;
    try {
        var customer = await dbContext.Customers.FindAsync(model.CustomerId);
        var order = await CreateOrder(customer, model);
        payment = await stripeService.PostPaymentAsync(order);
        await sendGridService.SendPaymentSuccessEmailAsync(order);
        await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    } catch {
        if (payment != null) {
            await stripeService.RefundPaymentAsync(payment);
        }
        throw;
    }
    return RedirectToAction("Success");
}

This doesn't quite address the problem, though, as something could still go wrong with our transaction. We really need to handle the database transaction explicitly ourselves. We'd want to extend our solution just a bit, devising a way to tell our MVC transaction filter to not open a transaction if we're explicitly opening one in our action:

[ExplicitTransaction]
public async Task<ActionResult> ProcessPayment(CartModel model) {  
    StripePayment payment = null;
    try {
        dbContext.BeginTransaction();
        var customer = await dbContext.Customers.FindAsync(model.CustomerId);
        var order = await CreateOrder(customer, model);
        payment = await stripeService.PostPaymentAsync(order);
        await sendGridService.SendPaymentSuccessEmailAsync(order);
        await bus.Publish(new OrderCreatedEvent { Id = order.Id });
        await dbContext.CommitTransactionAsync();
    } catch {
        if (payment != null) {
            await stripeService.RefundPaymentAsync(payment);
        }
        await dbContext.RollbackTransactionAsync();
        throw;
    }
    return RedirectToAction("Success");
}

Now we've got explicit control over our database transaction, and in the case of failure, we undo any Stripe transaction with a refund, and rollback our database transaction. All is tidied up! Well, unless the refund fails, but that's a different problem.

Coordinate

Stripe cannot participate in a two-phase commit - or can it? A two-phase commit with a payment gateway would look something like:

  1. Authorize the card for a charge
  2. Charge the card

It turns out that credit cards to support this sort of approach, it just can look weird in your account. You might see "pending transactions", and in fact, you're probably already used to seeing this. Many hotels and rental companies authorize a charge on your card, then either charge the card for incidentals or just let the authorization expire.

Stripe has this option as well, with the auth-capture flow. Un-captured authorizations will be reversed after 7 days, and on top of that, we can go into the Stripe admin site to see any un-captured authorizations to capture them as necessary. We can't really do two-phase commit inside our web flow (since we don't have an actual coordinator), but we can come close. In this new flow, we'd authorize after everything else succeeded:

[ExplicitTransaction]
public async Task<ActionResult> ProcessPayment(CartModel model) {  
    StripePayment payment = null;
    try {
        dbContext.BeginTransaction();
        var customer = await dbContext.Customers.FindAsync(model.CustomerId);
        var order = await CreateOrder(customer, model);
        payment = await stripeService.PostAuthorizationAsync(order);
        await sendGridService.SendPaymentSuccessEmailAsync(order);
        await bus.Publish(new OrderCreatedEvent { Id = order.Id });
        await dbContext.CommitTransactionAsync();
    } catch {
        await dbContext.RollbackTransactionAsync();
        throw;
    }
    try {
        await stripeService.PostPaymentCaptureAsync(payment);
    } catch {
        Logger.Warning("Stripe authorization failed; authorize manually via Stripe UI");
    }
    return RedirectToAction("Success");
}

Once everything's succeeded on our side, we can finalize the authorization against Stripe. If anything goes wrong with our "PostAuthorizationAsync" call, we'll just swallow the error since we can always go into the Stripe UI and authorize ourselves.

In actuality, none of these options are that great, but if I want to preserve the behavior from the client of "card will be charged at button click". In most e-commerce apps I've been involved with, we tend not to charge right at button click since failures are difficult to recover from.

In the next post, we'll look at our options with SendGrid.

Categories: Blogs