Skip to content


Use "Golden Image" to test Big Ball Of Mud software systems

Chris McMahon's Blog - Tue, 02/21/2017 - 01:42

So I had a brief conversation on Twitter with Noah Sussman about testing a software system designed as a "Big Ball Of Mud" (BBOM).

We could talk about the technical definition of BBOM, but in practical terms a BBOM is a system where we understand and expect that changing one part of the system is likely to cause unknown and unexpected results in other, unrelated parts of the system. Such systems are notoriously difficult to test, but I have tested them long ago in my career, and I was surprised that Noah hadn't encountered this approach of using a "Golden Image" to accomplish that.

Let's assume that we're creating an automated system here. Every part of the procedure I describe can be automated.

First you need some tests. And you'll need a test environment. BBOM systems come in many different flavors, so I won't specify a test environment too closely. It might be a clone of the production system, or a version of prod with fewer data.  It might be something different than that.

Then you need to be able to make a more-or-less exact copy of your test environment. This may mean putting your system on a VM or a Docker image, or it may be a matter of simply copying files. However you accomplish it, you need to be able to make faithful "Golden Image" copies of your test environment at a particular point in time.

Now you are ready to do some serious testing of a BBOM system using Golden Images:

Step One: Your test environment right now is your Golden Image. Make a copy of your Golden Image.

Step Two: Install the software to be tested on the copy of your Golden Image. Run your tests. If your tests pass, deploy the changes to production. Check to make sure that you don't have to roll back any of the production changes. If your tests fail or if your changes to production get rolled back, go back to Step One.

Step Three: the copy of your first Golden Image with the successful changes is your new Golden Image. You may or may not want to discard the now obsolete original Golden Image, see Step Five below.

Step Four: Add more tests for the system. Repeat the procedure at Step One.

Step Five (optional) You may want to be able to compare aspects of a current Golden Image test environment with previous versions of the Golden Image. Differences in things like test output behavior, file sizes, etc. may be useful information in your testing practice.

Categories: Blogs

Help Linnea

“There is a saying that it takes a whole village to raise a child. Now we need a whole village to save our Linnea”

Linnea, Kristoffer Nordströms daughter, is five and a half years and comes from Karlskrona in Sweden. Her world revolved up until recently around My Little Ponies, riding her bicycle and popcorn… lots of popcorn. She who has one best friend: her beloved big brother Kristian.
That was her world – until a few months ago when she suddenly and shockingly became afflicted, and got emergency surgery for a brain tumor.
After the operation, we hoped that the bad news would end. But now the family lives in the hospital and has been told that the tumor is an aggressive variety called DIPG (Diffuse Intrinsic Pontine Glioma). The short story is that there is a heart-breakingly minimal chance of survival using established treatments.

There is a possible treatment that we are now aiming for: one that means the tumor is treated through catheters implanted directly into the tumor. Studies and reports show that such a direct treatment gives Linnea the best chance of one day becoming healthy. The cost of treatment and the journeys are very high. Higher than the average person can pay for: £ 65.000 for the first operation and then £ 6.500 for treatments thereafter. In the current situation, it is unclear how many of these Linnea will need.

Please help Kristoffer and his family!

Categories: Blogs

Before Testing

Hiccupps - James Thomas - Mon, 02/20/2017 - 07:25

I happened across Why testers? by Joel Spolsky at the weekend. Written back in 2010, and - if we're being sceptical - perhaps a kind of honeytrap for Fog Creek's tester recruitment process, it has some memorable lines, including:
what testers are supposed to do ... is evaluate new code, find the good things, find the bad things, and give positive and negative reinforcement to the developers.Otherwise it’s depressing to be a programmer. Here I am, typing away, writing all this awesome code, and nobody really need very smart people as testers, even if they don’t have relevant experience. Many of the best testers I’ve worked with didn’t even realize they wanted to be testers until someone offered them the job.The job advert that the post points at is still there and reinforces the focus on testing as a service to developers and the sentiments about feedback, although it looks like, these days, they do require test experience.

It's common to hear testers say that they "fell into testing" and I've offered jobs to, and actually managed to recruit from, non-tester roles. On the back of reading Spolsky's blog I tweeted this:
#Testers, one tweet please. What did you do before testing? What's the most significant difference (in any respect) between that and now?— James Thomas (@qahiccupps) February 18, 2017 And, while it's a biased and also self-selected sample (to those who happen to be close enough to me in the Twitter network, and those who happened to see it in their timeline, and those who cared to respond) which has no statistical validity, I enjoyed reading the responses and wondering about patterns.

Please feel free to add your own story about the years BT (Before Testing) to either the thread or the comments here.
Categories: Blogs

Refactoring Towards Resilience: Async Workflow Options

Jimmy Bogard - Fri, 02/17/2017 - 23:44

Other posts in this series:

In the last post, we looked at coupling options in our 3rd-party resources we use as part of "button-click" place order, and whether or not we truly needed that coupling or not. As a reminder, coupling is neither good nor bad, it's the side-effects of coupling to the business we need to evaluate in terms of desirable and undesirable, based on tradeoffs. We concluded that in terms of what we needed to couple:

  • Stripe: minimize checkout fallout rate; process offline
  • Sendgrid: Send offline
  • RabbitMQ: Send offline

Basically, none of our actions we determined to need to happen right at button click. This doesn't hold true for every checkout page, but we can make that assumption for this example.

Sidenote - in the real-life version of this, we opted for Stripe - undo, SendGrid - ignore, RabbitMQ - ignore, with offline manual re-sending based on alerts.

With this in mind, we can design a process that manages the side effects of an order placement separate from placing the order itself. This is going to make our process more complicated, but distributed system tend to be more complicated if we decide we don't want to blissfully ignore failures.

Starting the worfklow

Now that we've decided we can process our three resources out-of-band, the next question becomes "how do I signal to that back-end processing to do its work?" This largely depends on what your backend includes, if you're on-prem or cloud etc etc. Azure for example offers a number of options for us for "async processing" including:

  • Azure WebJobs
  • Azure Service Bus
  • Azure Scheduler

In my situation, we weren't deploying to Azure so that wasn't an option for us. For on-prem, we can look at:

From these three, I'm inclined towards Hangfire as it's easy to integrate into my Web API/MVC/ASP.NET Core app. A single background job executing say, once a minute, can check for any pending messages and send them along:

RecurringJob.AddOrUpdate(() => {  
    using (var db = new CartContext()) {
        var unsent = db.OutboxMessages.ToList();
        foreach (var msg in unsent) {
}, "*/1 * * * *");

Not too complicated, and this will catch any unsent messages from our API that we tried to send after the DB transaction. Once a minute should be quick enough to catch unsent messages and still not have it seem like to the end user that they're missing emails.

Now that we've got a way to kick off our workflow, let's look at our workflow options themselves.

Workflow options

There's still some ordering I need to enforce on my external resource operations, as I don't want emails to be sent without payment success. Additionally, because of the resilience options we saw earlier, I don't really want to couple each operation together. Because of this, I really want to break my workflow in multiple steps:

In our case, we can look at three major workflows: Routing Slip, Saga, and Process Manager. The Process Manager pattern can further break down into more detailed patterns, from the Microservices book "Choreography" and "Orchestration", or as I detailed them a few years back, "Controller" and "Observer".

With these options in mind, let's look at each in turn to see if they would be appropriate to use for our workflow.

Routing Slip

Routing slip is an interesting pattern that allows each individual step in the process to be decoupled from the overall process flow. With a routing slip, our process would look like:

We start create a message that includes a routing slip, and include a mechanism to forward along:

Bus.Route(msg, new [] {"Stripe", "SendGrid", "RabbitMQ"}  

Where "RabbitMQ" is really just and endpoint at this point to publish a "OrderComplete" message.

From the business perspective, does this flow make sense? Going back to our coordination options for Stripe, under what conditions should we flow to the next step? Always? Only on successful payment? How do we handle failures?

The downside of the Routing Slip pattern is it's quite difficult to introduce logic to our workflow, handle failures, retries etc. We've used it in the past successfully, and I even built an extension to NServiceBus for it, but it tends to fall down in our scenario especially around Stripe where I might need to do an refund. Additionally, it's not entirely clear if we publish our "Order Complete" message when SendGrid is down. Right now, it doesn't look good.


In the saga pattern, we have a series of actions and compensations, the canonical example being a travel booking. I'm booking together a flight, car, and hotel, and only if I can book all 3 do I call my travel booking "complete":


Does a Saga make sense in my case? From our coordination examination, we found that only the Stripe resource had a capability of "Undo". I can't "Undo" a SendGrid call, nor can I "Undo" a message published.

For this reason, a Saga doesn't make much sense. Additionally, I don't really need to couple my process together like this. Sagas are great when I have an overall business transaction that I need to decompose into smaller, compensate-friendly transactions. That's clearly not what I have here.

Process Manager - Orchestration/Controller

Our third option is a process manager that acts as a controller, orchestrating a process/workflow from a central point:

Now, this still doesn't make much sense because we're coupling together several of our operations, making an assumption that our actions need to be coordinated in the first place.

So perhaps let's take a step back at our process, and examine what actually needs to be coordinated with what!

Process Examination

So far we've looked at process patterns and tried to bolt them onto our steps. Let's flip that, and go back to our original flow. We said our process had 4 main parts:

  1. Save order
  2. Process payment
  3. Email customer
  4. Notify downstream

From a coupling perspective, we said that "Process Payment" must happen only after I save the order. We also said that "Email customer" must happen only if "Process Payment" was successful. Additionally, our "notify downstream" step must only happen if our order successfully processed payment.

Taking a step back, isn't the "email customer" a form of notifying downstream systems that the order was created? Can we just make an additional consumer of the OrderCreatedEvent be one that sends onwards? I think so!

But we still have the issue of payment failure, so our process manager can handle those cases as well. And since we've already made payments asynchronous, we need some way to signal to the help team that an order is in a failed payment state.

With that in mind, our process will be a little of both, orchestration AND choreography:

We treat the email as just another subscriber of our event, and our process manager now is only really concerned about completing the order.

In our final post, we'll look at implementing our process manager using NServiceBus.

Categories: Blogs

Transferring testing skills

Agile Testing with Lisa Crispin - Fri, 02/17/2017 - 22:46
Transferring Agile Testing Skills webinarTransferring Agile Testing Skills webinar

Recently I did a short webinar with Agile Testing Days “AgileTD Mondays” series on “Transferring Testing Skills to the Whole Team. It’s about 20 minutes long, I hope it will inspire you to spread some testing love around your agile team!

How do you help non-testers on the team learn testing skills? I’d love to hear more ideas.

There are several other excellent webinars in the series – please check them all out!

The post Transferring testing skills appeared first on Agile Testing with Lisa Crispin.

Categories: Blogs

The Test Case Is Not The Test

DevelopSense Blog - Thu, 02/16/2017 - 20:17
A test case is not a test. A recipe is not cooking. An itinerary is not a trip. A score is not a musical performance, and a file of PowerPoint slides is not a conference talk. All of the former things are artifacts; explicit representations. The latter things are human performances. When the former things […]
Categories: Blogs

Security Testing Toolbox

Testing TV - Wed, 02/15/2017 - 14:54
Kali, Veil, Metasploit, BeEF. All tools in an arsenal that exist to break through security barriers of software. This talk introduces the tools available and shows how they are used to get through your defense. It is more a massive demo than a talk and is an exploration of the tools and what they do. […]
Categories: Blogs

Refactoring Towards Resilience: Evaluating Coupling

Jimmy Bogard - Tue, 02/14/2017 - 23:25

Other posts in this series:

So far, we've been looking at our options on how to coordinate various services, using Hohpe as our guide:

  • Ignore
  • Retry
  • Undo
  • Coordinate

These options, valid as they are, make an assumption that we need to coordinate our actions at a single point in time. One thing we haven't looked at is breaking the coupling of our actions, which greatly widens our ability to deal with failures. The types of coupling I encounter in distributed systems (but not limited to) include:

  • Behavioral
  • Temporal
  • Platform
  • Location
  • Process

In our code:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");

Of the coupling types we see here, the biggest offender is Temporal coupling. As part of placing the order for the customer's cart, we also tie together several other actions at the same time. But do we really need to? Let's look at the three external services we interact with and see if we really need to have these actions happen immediately.

Stripe Temporal Coupling

First up is our call to Stripe. This is a bit of a difficult decision - when the customer places their order, are we expected to process their payment immediately?

This is a tough question, and one that really needs to be answered by the business. When I worked on the cart/checkout team of a Fortune 50 company, we never charged the customer immediately. In fact, we did very little validation beyond basic required fields. Why? Because if anything failed validation, it increased the chance that the customer would abandon the checkout process (we called this the fallout rate). For our team, it made far more sense to process payments offline, and if anything went wrong, we'd just call the customer.

We don't necessarily have to have a black-and-white choice here, either. We could try the payment, and if it fails, mark the order as needing manual processing:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    try {
        var payment = await stripeService.PostPaymentAsync(order);
    } catch (Exception e) {
        Logger.Exception(e, $"Payment failed for order {order.Id}");
    if (!order.PaymentFailed) {
        await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");

There may also be business reasons why we can't process payment immediately. With orders that ship physical goods, we don't charge the customer until we've procured the product and it's ready to ship. Otherwise we might have to deal with refunds if we can't procure the product.

There are also valid business reasons why we'd want to process payments immediately, especially if what you're purchasing is digital (like a software license) or if what you're purchasing is a finite resource, like movie tickets. It's still not a hard and fast rule, we can always build business rules around the boundaries (treat them as reservations, and confirm when payment is complete).

Regardless of which direction we go, it's imperative we involve the business in our discussions. We don't have to make things technical, but each option involves a tradeoff that directly affects the business. For our purposes, let's assume we want to process payments offline, and just record the information (naturally doing whatever we need to secure data at rest).

SendGrid Temporal Coupling

Our question now is, when we place an order, do we need to send the confirmation email immediately? Or sometime later?

From the user's perspective, email is already an asynchronous messaging system, so there's already an expectation that the email won't arrive synchronously. We do expect the email to arrive "soon", but typically, there's some sort of delay. How much delay can we handle? That again depends on the transaction, but within a minute or two is my own personal expectation. I've had situations where we intentionally delay the email, as to not inundate the customer with emails.

We also need to consider what the email needs to be in response to. Does the email get sent as a result of successfully placing an order? Or posting the payment? If it's for posting the payment, we might be able to use Stripe Webhooks to send emails on successful payments. In our case, however, we really want to send the email on successful order placement not order payment.

Again, this is a business decision about exactly when our email goes out (and how many, for what trigger). The wording of the message depends on the condition, as we might have a message for "thank you for your order" and "there was a problem with your payment".

But regardless, we can decouple our email from our button click.

RabbitMQ Coupling

RabbitMQ is a bit of a more difficult question to answer. Typically, I generally assume that my broker is up. Just the fact that I'm using messaging here means that I'm temporally decoupled from recipients of the message. And since I'm using an event, I'm behaviorally decoupled from consumers.

However, not all is well and good in our world, because if my database transaction fails, I can't un-send my message. In an on-premise world with high availability, I might opt for 2PC and coordinate, but we've already seen that RabbitMQ doesn't support 2PC. And if I ever go to the cloud, there are all sorts of reasons why I wouldn't want to coordinate in the cloud.

If we can't coordinate, what then? It turns out there's already a well-established pattern for this - the outbox pattern.

In this pattern, instead of sending our messages immediately, we simply record our messages in the same database as our business data, in an "outbox" table":

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    dbContext.SaveMessage(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");

Internally, we'll serialize our message into a simple outbox table:

public class Message {  
    public Guid Id { get; set; }
    public string Destination { get; set; }
    public byte[] Body { get; set; }

We'll serialize our message and store in our outbox, along with the destination. From there, we'll create some offline process that polls our table, sends our message, and deletes the original.

while (true) {  
    var unsentMessages = await dbContext.Messages.ToListAsync();
    var tasks = new List<Task>();
    foreach (var msg in unsentMessages) {
           .ContinueWith(t => dbContext.Messages.Remove(msg)));
    await Task.WhenAll(tasks.ToArray());

With an outbox in place, we'd still want to de-duplicate our messages, or at the very least, ensure our handlers are idempotent. And if we're using NServiceBus, we can quite simply turn on Outbox as a feature.

The outbox pattern lets us nearly mimic the 2PC coordination of messages and our database, and since this message is a critical one to send, warrants serious consideration of this approach.

With all these options considered, we're now able to design a solution that properly decouples our different distributed resources, still satisfying the business goals at hand. Our next post - workflow options!

Categories: Blogs

People are Strange

Hiccupps - James Thomas - Tue, 02/14/2017 - 19:01

Managers. They're the light in the fridge: when the door is open their value can be seen. But when the door is closed ... well, who knows?

Johanna Rothman and Esther Derby reckon they have a good idea. And they aim to show, in the form of an extended story following one manager as he takes over an existing team with problems, the kinds of things that managers can do and do do and - if they're after a decent default starting point - should consider doing.

What their book, Behind Closed Doors, isn't - and doesn't claim to be - is the answer to every management problem. The cast of characters in the story represent some of the kinds of personalities you'll find yourself dealing with as a manager, but the depth of the scenarios covered is limited, the set of outcomes covered is generally positive, and the timescales covered are reasonably short.

Michael Lopp, in Managing Humans, implores managers to remember that their staff are chaotic beautiful snowflakes. Unique. Individual. Special. Jim Morrison just says, simply, brusquely, that people are strange. (And don't forget that managers are people, despite evidence to the contrary.)

Either way, it's on the manager to care to look and listen carefully and find ways to help those they manage to be the best that they can be in ways that suit them. Management books necessarily use archetypes as a practical way to give suggestions and share experiences, but those new to management especially should be wary of misinterpreting the stories as a how-to guide to be naively applied without consideration of the context.

What Behind Closed Doors also isn't, unlike so much writing on management, is dry, or full of heroistic aphorisms, or preachy. In fact, I found it an extremely easy read for several reasons: it's well-written; it's short; the story format helps the reader along; following a consistent story gives context to situations as the book progresses; sidebars and an appendix keep detail aside for later consumption; I'm familiar with work by both of these authors already; I'm a fan of Jerry Weinberg's writing on management and interpersonal relationships and this book owes much to his insights (he wrote the foreword here); I agree with much of the advice.

What I found myself wanting - and I'd buy Rothman and Derby's version of this like a shot - is more detailed versions of some of the dialogues in this book with commentary in the form of the internal monologues of the participants. I'd like to hear Sam, the manager, thinking though the options he has when trying to help Kevin to learn to delegate and understand how he chose the approach that he took. I'd like to hear Keven trying to work out what he thinks Sam's motives are and perhaps rejecting some of Sam's premises.  I'd also like to see a deeper focus on a specific relationship over an extended period of time, with failures, and techniques for rebuilding trust in the face of them.

But while I wait for that, here's a few quotes that I enjoyed, loosely grouped.

On the contexts in which management takes place:
Generally speaking, you can observe only the public behaviors of managers and how your managers interact with you. Sometimes people who have never been in a management role believe that managers can simply tell other people what to do and that’s that. The higher you are in the organization, the more other people magnify your reactions. Because managers amplify the work of others, the human costs of bad management can be even higher than the economic costs. Chaos hides problems—both with people and projects. When chaos recedes, problems emerge. The moral of this fable is: Focus on the funded work.On making a technical contribution as a manager:
Some first-level managers still do some technical work, but they cannot assign themselves to the critical path.

It’s easier to know when technical work is complete than to know when management work is complete.

The more people you have in your group, the harder it is to make a technical contribution.

The payoff for delegation isn’t always immediate.

It takes courage to delegate.On coaching:
You always have the option not to coach. You can choose to give your team member feedback (information about the past), without providing advice on options for future behavior.

Coaching doesn’t mean you rush in to solve the problem. Coaching helps the other person see more options and choose from them.

Coaching helps another person develop new capability with support.

And it goes without saying, but if you offer help, you need to follow through and provide the help requested, or people will be disinclined to ask again.

Helping someone think through the implications is the meat of coaching.On team-building:
Jelled teams don’t happen by accident; teams jell when someone pays attention to building trust and commitment

Over time they build trust by exchanging and honoring commitments to each other.

Evaluations are different from feedback.

A one-on-one meeting is a great place to give appreciations.

[people] care whether the sincere appreciation is public or private ... It’s always appropriate to give appreciation for their contribution in a private meeting.

Each person on your team is unique. Some will need feedback on personal behaviors. Some will need help defining career development goals. Some will need coaching on how to influence across the organization.

Make sure the career development plans are integrated into the person’s day-to-day work. Otherwise, career development won’t happen.

"Career development" that happens only once a year is a sham.On problem solving:
Our rule of thumb is to generate at least three reasonable options for solving any problem.

Even if you do choose the first option, you’ll understand the issue better after considering several options.

If you’re in a position to know a problem exists, consider this guideline for problem solving: the people who perform the work need to be part of the solution.

We often assume that deadlines are immutable, that a process is unchangeable, or that we have to solve something alone. Use thought experiments to remove artificial constraints,

It’s tempting to stop with the first reasonable option that pops into your head. But with any messy problem, generating multiple options leads to a richer understanding of the problem and potential solutions

Before you jump to solutions, collect some data. Data collection doesn’t have to be formal. Look for quantitative and qualitative data.

If you hear yourself saying, “We’ll just do blah, blah, blah,” Stop! “Just” is a keyword that lets you know it just won’t work.

When the root cause points to the original issue, it’s likely a system problem.On managing:
Some people think management is all about the people, and some people think management is all about the tasks. But great management is about leading and developing people and managing tasks.

When managers are self-aware, they can respond to events rather than react in emotional outbursts.

And consider how your language affects your perspective and your ability to do your job.

Spending time with people is management work.

Part of being good at [Managing By Walking Around and Listening] is cultivating a curious mind, always observing, and questioning the meaning of what you see.

Great managers actively learn the craft of management.Image:
Categories: Blogs

Open Letter about Agile Testing Days cancelling US conference

Chris McMahon's Blog - Tue, 02/14/2017 - 03:21
I sent the following by email contact pages to Senator John McCain, Senator Jeff Flake, and Representative Martha McSally of Arizona in regard to Agile Testing Days cancelling their US conference on 13 February.

Agile Testing Days is a top-tier tech conference about software testing and Quality Assurance in Europe. They had planned their first conference in the USA to be held in Boston MA, with a speaker lineup from around the world. They cancelled the entire conference on 13 February because of the "current political situation" in the USA. Here is their statement:

Although I was not scheduled to attend or to speak at this particular conference, it is conferences such as Agile Testing Days where the best ideas in my field are presented, and it is from conferences such as Agile Testing Days that many of my peers get those ideas, and I rely on conversations from those who do speak and attend in order to stay current in my field.

As a resident of Arizona, cancelling such conferences affects me directly. I have enough expertise and skill to live anywhere I choose. I choose to live in Arizona, but my work absolutely depends on the free flow of people and information across national and state borders.

It is shameful that such a prestigious and respected multi-national software organization finds it necessary to cancel their first ever conference in the USA because of the outrageous policies of the current administration. I urge you to take measures to make organizations such as Agile Testing Days and their attendees and speakers feel safe and welcome, as they should be.

Chris McMahon
Senior Member of Technical Staff, Quality Assurance
Tucson, AZ
Categories: Blogs

Discomfort as a Tool for Change

Google Testing Blog - Mon, 02/13/2017 - 18:53
by Dave Gladfelter (SETI, Google Drive)
IntroductionThe SETI (Software Engineer, Tools and Infrastructure) role at Google is a strange one in that there's no obvious reason why it should exist. The SWEs (Software Engineers) on a project understand its problems best, and understanding a problem is most of the way to fixing it. How can SETIs bring unique value to a project when SWEs have more on-the-ground experience with their impediments?

The answer is scope. A SWE is rewarded for being an expert in their particular area and domain and is highly motivated to make optimizations to their carved-out space. SETIs (and Test Engineers and EngProdin general) identify and solve product-wide problems.

Product-wide problems frequently arise because local optimizations don't necessarily add up to product-wide optimizations. The reason may be the limits of attention, blind spots, or mis-aligned incentives, but a group of SWEs each optimizing for their own sub-projects will not achieve product-wide maxima.

Often SETIs and Test Engineers (TEs) know what behavior they'd like to see, such as more integration tests. We may even have management's ear and convince them to mandate such tests. However, in the absence of incentives, it's unlikely that the decisions SWEs make in response to such mandates will add up to the behavior we desire. Mandates around methods/practices are often ineffective. For example, a mandate of documentation for each public method on an interface often results in "method foo does foo."

The best way to create product-wide efficiencies is to change the way the team or process works in ways that will (initially) be uncomfortable for the engineering team, but that pays dividends that can't be achieved any other way. SETIs and TEs must work to identify the blind spots and negative interactions between engineering teams and change the environment in ways that align engineering teams' incentives. When properly incentivized, SWEs will make optimal decisions enhanced by product-wide vision rather than micro-management.
Common Product-Wide ProblemsHard-to-use APIsOne common example of local optimizations resulting in cross-team de-optimization is documentation and ease-of-use of internal APIs. The team that implements an internal API is not rewarded for making it easy to use except in the most oblique ways. Clients are compelled to use the internal APIs provided to them, so the API owner has a monopoly and will set the price of using it at "you must read all the code and debug it yourself" in the absence of incentives or (rare) heroes.
Big, slow releasesAnother example is large and slow releases. Without EngProd help or external pressure, teams will gravitate to the slowest, biggest release possible.

This makes sense from the position of any individual SWE: releases are painful, you have to ensure that there are no UI and API regressions, watch traffic and error rates for some time, and re-learn and use tools and processes that are complex and specific to releases.

Multiple teams will naturally gravitate to having one big release so that all of these costs can be bundled into one operation for "efficiency." The result is that engineers don't get feedback on features for weeks and versioning of APIs and data stores is ignored (since all the parts of the system are bundled together into one big release). This greatly slows down developer and feature velocity and greatly increases risks of cascading failures when the release fails.
How EngProd fixes product-wide problemsSETIs can nibble around the edges of these kinds of problems by writing tools and automation. TEs can create easy-to-use test environments that facilitate isolating and debugging faults in integration and ambiguities in APIs. We can use fancy technologies to sample live traffic and ensure that new versions of systems behave the same as previous versions. We can review design docs to ensure that they have an appropriate test plan. Often these actions do have real value. However, these are not the best way to align incentives to create a product-wide solution. Facilitating engineering teams' fruitful collaboration (and dis-incentivizing negative interactions) gives EngProd a multiplier that is hard to achieve with only tooling and automation.

Heroes are few and far between so we must turn to incentives, which is where discomfort comes in. Continuity is comfortable and change is painful. EngProd looks at how to change the problem so that teams are incentivized to work together fruitfully and disincentivized (discomforted) to pursue local optimizations exclusively.

So how does EngProd align incentives? Certainly there is a place for optimizing for optimal behaviors, such as easy-to-use integration environments. However, incentivizing optimal behaviors via negative feedback should not be overlooked. Each problem is different, so let's look at how to address the two examples above:
Incentivizing easy-to-use APIsEngineers will make the things they're incentivized to make. For APIs, make teams incentivized to provide integration help in the form of fakes. EngProd works with team leads to ensure there are explicit objectives to provide Fakes for their APIs as part of the rollout.

Fakesare as-simple-as-possible implementations of a service that still can be used to do pre-submit testing of client interactions with the system. They don't replace integration tests, but they reduce the likelihood of finding errors in subsequent integration test runs by an order of magnitude.
Furthermore, have some subset of the same client-owned and server-owned tests run against the fakes (for quick presubmit testing) as well as the real implementation (for continuous integration testing) and work with management to make it the responsibility of the Fake owner to debug any discrepancies for either the client- or the server-owned tests.

This reverses the pain! API owners, who are in a position to make APIs better, are now the ones experiencing negative incentives when APIs are not easy to use. Previously, when clients felt the pain, they had no recourse other than to file easily-ignored bugs ("Closed: working as intended") or contribute changes to the API owners' codebase, hurting their own performance with distractions.

This will incentivize API owners to design APIs to be as simple as possible with as few side-effects as possible, and to provide high-quality fakes that make it easy for clients to integrate with the API. Some teams will certainly not like this change at first, but I have seen API teams come to the realization that this is the best choice for the larger effort and implement these practices despite their cost to the team in the short run.

Helping management set engineering team objectives may not seem like a typical SETI responsibility, but although management is responsible for setting performance incentives and objectives, they are not well-positioned to understand how the low-level decisions of different teams create harmful interactions and lower cross-team performance, so they need SETI and TE guidance to create an environment that encourages optimal behaviors.
Fast, small releasesBeing forced to release more frequently than is required by feature deployment requirements has many beneficial side-effects that make release velocity a goal unto itself. SETIs and TEs faced with big, slow releases work with management to mandate a move to a set of smaller, more frequent releases. As release velocity is ratcheted up, negative behaviours such as too much manual testing or too much internal coupling become more painful, and many optimal behaviors are incentivized.
Less coupling between systemsWhen software is released together, it is easy to treat the seams between different components as implementation details. Resulting systems becoming so intertwined (coupled) that responsibilities between them are completely and randomly mixed and their interactions are too complex for any one person to understand. When two components are released separately and at different times, different versions of them must be compatible with one another. Engineers who were previously complacent about this fragility will become fearful of failed releases due to implicit contract changes. They will change their behavior in beneficial ways such as defining the contract between components explicitly and creating regression testing for it. The result is a system composed of robust, self-contained, more easily understood components.
Better/More automated testingManual testing becomes more painful as release velocity is ramped up. This will incentivize automated regression, UI and performance tests. This makes the team more agile and able to catch defects sooner and more cheaply.
Faster feedbackWhen incremental feature changes can be released to dogfood or other beta channels more frequently, user interaction designers and product managers get much faster feedback about what paths lead to better user engagement and experience than in big, slow releases where an entire feature is deployed simultaneously. This results in a better product.
ConclusionThe SETIs and TEs optimize interactions between teams and create fixes for product-wide, cross-team problems in order to improve engineering productivity and velocity. There are many worthwhile projects that EngProd can do using broad knowledge of the system and expertise in refactoring, automation and testing, such as creating test fixtures that enable continuous integration testing or identifying and combining duplicative tests or tools.

That said, the biggest problem that EngProd is positioned to solve is to break the chain of local optimizations resulting in cross-team de-optimizations. To that end, discomfort is a tool that can incentivize engineers to find solutions that are optimal for the entire product. We should look for and advocate for these transformative changes.
Categories: Blogs

The Bug in Lessons Learned

Hiccupps - James Thomas - Fri, 02/10/2017 - 21:52

The Test team book club read Lessons Learned in Software Testing the other week. I couldn't find my copy at the time but Karo came across it today, on Rog's desk, and was delighted to tell me that she'd discovered a bug in it...
Categories: Blogs

Refactoring Towards Resilience: Evaluating RabbitMQ Options

Jimmy Bogard - Fri, 02/10/2017 - 19:50

Other posts in this series:

In the last post, we looked at dealing with an API in SendGrid that basically only allows at-most-once calls. We can't undo anything, and we can't retry anything. We're going to find some similar issues with RabbitMQ (although it's not much different than other messaging systems).

RabbitMQ, like all queuing systems I can think of, offer a wide variety of reliability modes. In general, I try to make my message handlers idempotent, as it enables so many more options up stream. I also don't really trust anyone sending me messages so anything I can do to ensure MY system stays consistent despite what I might get sent is in my best interest.

Looking back at our original code:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");

We can see that if anything fails after the "bus.Publish" line, we don't really know what happened to our message. Did it get sent? Did it not? It's hard to tell, but going to our picture of our transaction model:

Transaction flow

And our options we have to consider as a reminder:

Coordination Options

Let's take a look at our options dealing with failures.


Similar to our SendGrid solution, we could just ignore any failures with connecting to our broker:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    try {
        await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    } catch (Exception e) {
        Logger.Exception(e, $"Failed to send order created event for order {order.Id}");
    return RedirectToAction("Success");

This approach would shield us from connectivity failures with RabbitMQ, but we'd still need some sort of process to detect these failures and retry those sends later on. One way to do this would be simply to flag our orders:

} catch (Exception e) {
    order.NeedsOrderCreatedEventRaised = true;
    Logger.Exception(e, $"Failed to send order created event for order {order.Id}");

It's not a very elegant solution, as I'd have to create flags for every single kind of message I send. Additionally, it ignores the issue of a database transaction rolling back, but my message is still sent. In that case, my message will still get sent, and consumers could get events for things that didn't actually happen! There are other ways to fix this - but for now, let's cover our other options.


Retries are interesting in RabbitMQ because although it's fairly easy to retry my message on my side, there's no guarantee that consumers can support a message if it came in twice. However, in my applications, I try as much as possible to make my message consumers idempotent. It makes life so much easier, and allows so many more options, if I can retry my message.

Since my original message includes the unique order ID, a natural correlation identifier, consumers can have an easy way of ensuring their operations are idempotent as well.

The mechanics of a retry could be similar to our above example - mark the order as needing a retry of the event to be raised at some later point in time, or retry in the same block, or include a resiliency layer on top of sending.


RabbitMQ doesn't support any sort of "undo" natively, so if we wanted to do this ourselves, we'd have to have some sort of compensating event published. Perhaps an event, "OrderNotActuallyCreatedJustKiddingAboutBefore"?

Perhaps not.


RabbitMQ does not natively support any sort of two-phase commit, so coordination is out.

Next steps

Now that we've examined all of our options around the various services our application integrates with, I want to evaluate each service in terms of the coupling we have today, and determine if we truly need that level of coupling.

Categories: Blogs

Webinar Slides and Recording - Security Testing: The Missing Link in Information Security

Thanks to everyone who participated in today's webinar. I really enjoyed the time together, even if I did experience a complete system failure and restart in the latter part of the webinar. Just to let you know how the rest of today went, I was checking out this evening at Wal-mart (not self-checkout) and after I scanned my debit card, the pin pad displayed a message, "System shutdown in progress". I don't know what it is about me, but I swear, systems fail in my presence. It has been that way for over 20 years now! Oh, the joys of being a tester!

OK, here we go...

Here is the recording link. I have edited the video so that all slides are shown and discussed.

Here is a PDF with the slides in 2-up format.

Here is a PDF with the slides in full color format.

I hope you find the information helpful. Feel free to share it. I hope it can help you build the awareness of the need for security testing in your organization.



Categories: Blogs

Testing maturity in an agile/CDT environment

One day during a team meeting at Joep‘s previous job at a bank the Team Manager of Testing, listed a number of topics his testers could work on in the coming months. One of those topics was “testing maturity”. This topic was on the list not because this manager was such a fan of maturity models, but because the other team managers (Business Analysis and Development) had produced one for their own teams and higher management would like to have one for testing as well. And although Joep saw little value in a classic five-tiered maturity model either, he was intrigued by the question: so what can you do with respect to maturity models that is of value?

Joep asked Huib to help him think of a way to create a valuable, context-driven way to work on maturity. Since Huib had been working for the same bank, they met and discussed the possibilities. Soon they found out that the criteria should be variable since maturity depends on context. They started experimenting with stack ranking and quite soon they had the first version of their “maturity model”.

After a first try-out at the bank Joep worked, we let it rest for a while. After a couple of months we wrote this article. It is the first version and it needs to be refined and polished. The heuristics lists are probably to long and need to be reduced. We think of this model as a card game that can be played with teams.

Currently we are also working on an agile version of this model, a card game for agile teams to assess their “maturity” to help them to find possible areas for improvements. More about that later.

We are curious about your thoughts. What do you think? Maybe you want to try the game? Feel free to try it out. We hope you will share your experiences with us.

Article (pdf) – Card game (pdf)


Categories: Blogs


Hiccupps - James Thomas - Sun, 02/05/2017 - 06:36

What Really Happened in Y2K? That's the question Professor Martyn Thomas is asking in a forthcoming lecture and in a recent Chips With Everything podcast, from which I picked a few quotes that I particularly enjoyed.

On why choosing to use two digits for years was arguably a reasonable choice, in its time and context:
The problem arose originally because when most of the systems were being programmed before the 1990s computer power was extremely expensive and storage was extremely expensive. It's quite hard to recall that back in 1960 and 1970 a computer would occupy a room the size of a football pitch and be run 24 hours a day and still only support a single organisation.It was because those things were so expensive, because processing was expensive and in particular because storage was so expensive that full dates weren't stored. Only the year digits were stored in the data. On the lack of appreciation that, despite the eventual understated outcome, Y2K exposed major issues:
I regard it as a signal event. One of these near-misses that it's very important that you learn from, and I don't think we've learned from it yet. I don't think we've taken the right lessons out of the year 2000 problem. And all the people who say it was all a myth prevent those lessons being learned.On what bothers him today:
I'm [worried about] cyber security. I think that is a threat that's not yet being addressed strategically. We have to fix it at the root, which is by making the software far less vulnerable to cyber attack ... Driverless cars scare the hell out of me, viewed through the lens of cyber security.We seem to feel that the right solution to the cyber security problem is to train as many people as we can to really understand how to look for cyber security vulnerabilities and then just send them out into companies ... without recognising that all we're doing is training a bunch of people find all the loopholes in the systems and then encourage companies to let them in and discover all their secrets.Similarly, training lots of school students to write bad software, which is essentially what we're doing by encouraging app development in schools, is just increasing the mountain of bad software in the world, which is a problem. It's not the solution.On building software:
People don't approach building software with the same degree of rigour that engineers approach building other artefacts that are equally important. The consequence of that is that most software contains a lot of errors. And most software is not managed very well.One of the big problems in the run-up to Y2K was that most major companies could not find the source code for their big systems, for their key business systems. And could not therefore recreate the software even in the form that it was currently running on their computers.  The lack of professionalism around managing software development and software was revealed by Y2K ... but we still build software on the assumption that you can test it to show that it's fit for purpose.On the frequency of errors in software:
A typical programmer makes a mistake in, if they're good, every 30 lines of program. If they're very, very good they make a mistake in every 100 lines. If they're typical it's in about 10 lines of code. And you don't find all of those by testing.  On his prescription:
The people who make the money out of selling us computer systems don't carry the cost of those systems failing. We could fix that. We could say that in a decade's time - to give the industry a chance to shape up - we were going to introduce strict liability in the way that we have strict liability in the safety of children's toys for example.Image: 
Categories: Blogs

Discovered comments

Rico Mariani's Performance Tidbits - Fri, 02/03/2017 - 11:47

I just discovered that there are all kinds of comments pending approval, some from several months ago. Not something I was used to on the old platform.

So apologies for not responding to those comments sooner, I got too used to the old system. Traditionally I’ve not censored any comments and I’ll try to do the same with these. Although one was an outright flame so there’s no point in that… but look for them in the next few days. If you see something of yours pop up, please forgive me, I didn’t mean to censor anything. It was just pilot error.

And, big surprise, most of the comments are people who think I’ve lost my mind about 64 bit processor benefits.

Maybe they’re right?

Categories: Blogs

symbol filter redux

Rico Mariani's Performance Tidbits - Fri, 02/03/2017 - 04:05

A while ago I provided this local symbol server proxy you could use to get just the symbols you want.  I was watching it work a couple of weeks ago and I noticed that most of the time what it ends up doing is proxying a 302 redirect.  Which was kind of cool because that meant that it didn’t actually have to do much heavy lifting at all and the symbols were coming directly from the original source.  That’s when it hit me that I had been doing it wrong the whole time.  So I deleted all the of the proxy code.  What it does now is that it always serves either a 404 if the request isn’t on the white-list or it serves a 302 if it is on the white list.  It just redirects according to the path provided.

The original article is here To use it, instead of doing this:

set _NT_SYMBOL_PATH=srv*http://yourserver/yourpath

Do this:

set _NT_SYMBOL_PATH=srv*http://localhost:8080/yourserver/yourpath

If the pattern matches a line in symfilter.txt it will serve a 302 redirect to http://yourserver/yourpath/[the usual pdb request path]

Note the syntax is slightly different than the original version.

Now that it’s only serving redirects or failures it probably could use the http listener directly because the reasons for doing sockets myself are all gone.  But I didn’t make that change.

The code is here:
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Net;
using System.Net.Sockets;

namespace SymbolFilter
    class FilterProxy
        // we'll accept connections using this listener
        static TcpListener listener;

        // this holds the white list of DLLs we do not ignore
        static List dllFilterList = new List();

        // default port is 8080, config would be nice...
        const int port = 8080;

        static void Main(string[] args)
            // load up the dlls

            // open the socket

            // all real work happens in the background, if you ever press enter we just exit
            Console.WriteLine("Listening on port {0}.  Press enter to exit", port);

        static void InitializeDllFilters()
                // we're just going to throw if it fails...
                StreamReader sr = new StreamReader("symfilter.txt");

                // read lines from the file
                string line;
                while ((line = sr.ReadLine()) != null)


            catch (Exception e)
                // anything goes wrong and we're done here, there's no recovery from this

        // Here we will just listen for connections on the loopback adapter, port 8080
        // this could really use some configuration options as well.  In fact running more than one of these
        // listening to different ports with different white lists would be very useful.
        public static void StartHttpListener()
                listener = new TcpListener(IPAddress.Loopback, port);
                listener.BeginAcceptTcpClient(new AsyncCallback(DoAcceptTcpClientCallback), listener);
            catch (Exception e)
                // anything goes wrong and we're done here, there's no recovery from this


        // when we accept a new listener this callback will do the job
        public static void DoAcceptTcpClientCallback(IAsyncResult ar)
            // Get the listener that handles the client request.
            TcpListener listener = (TcpListener)ar.AsyncState;

            // I should probably assert that the listener is the listener that I think it is here

            // End the operation and display the received data on  
            // the console.
                // end the async activity and open the client
                // I don't support keepalive so "using" is fine
                using (TcpClient client = listener.EndAcceptTcpClient(ar))
                    // Process the http request
                    // Note that by doing it here we are effectively serializing the listens
                    // that seems to be ok because mostly we only redirect
            catch (Exception e)
                // if anything goes wrong we'll just move on to the next connection

            // queue up another listen
            listener.BeginAcceptTcpClient(new AsyncCallback(DoAcceptTcpClientCallback), listener);

        // we have an incoming request, let's handle it
        static void ProcessHttpRequest(TcpClient client)
            // we're going to process the request as text
            NetworkStream stream = client.GetStream();
            StreamReader sr = new StreamReader(stream);

            // read until the first blank line or the end, whichever comes first
            var lines = new List();
            for (;;)
                var line = sr.ReadLine();

                if (line == null || line == "")


            // e.g. "GET /foo.pdb/DE1EBC3EE7E542EA96B066229D3A40081/foo.pdb HTTP/1.1"
            var req = lines[0];

            // avoid case sensitivity issues for matching the pattern
            var reqLower = Uri.UnescapeDataString(req).ToLowerInvariant();

            // loop over available patterns, if any matches early out
            int i;
            for (i = 0; i < dllFilterList.Count; i++)
                if (reqLower.Contains(dllFilterList[i]))

            // if we didn't match, or it isn't a GET or it isn't HTTP/1.1 then serve up a 404
            if (i == dllFilterList.Count || !req.StartsWith("GET /") || !req.EndsWith(" HTTP/1.1"))
                // you don't match, fast exit, this is basically the whole point of this thing
                // this is the real work
                Console.WriteLine("Matched pattern: {0}", dllFilterList[i]);
                RedirectRequest(client, req);

        // cons up a minimal 404 error and return it
        static void Return404(TcpClient client)
            // it doesn't get any simpler than this
            var sw = new StreamWriter(client.GetStream());
            sw.WriteLine("HTTP/1.1 404 Not Found");

        // cons up a minimal 404 error and return it
        static void Return302(TcpClient client, string server, string url)
            string line = String.Format("Location: http://{0}{1}", server, url);
            Console.WriteLine("302 Redirect {0}", line);

            // emit the redirect
            var sw = new StreamWriter(client.GetStream());
            sw.WriteLine("HTTP/1.1 302 Redirect");

        static void RedirectRequest(TcpClient client, string req)
            // we know this is safe to do because we already verified that the request starts with "GET /"
            string request = req.Substring(5); // strip off the "GET /"
            request = request.Substring(0, request.Length - 9);  // strip off " HTTP/1.1"

            // we're looking for the server that we are supposed to use for this request in our path
            // the point of this is to make it so that you can set your symbol path to include
            // http://localhost:8080/your-sym-server/whatever
            // and your-sym-server will be used by the proxy
            int delim = request.IndexOf('/', 1);

            // if there is no server specified then we're done
            if (delim < 0)
                // serve up an error

            // the target server is everything up to the / but not including it
            string targetServer = request.Substring(0, delim);

            // the new request URL starts with the first character after the |
            request = request.Substring(delim + 1);

            // if there isn't already a leading slash, then add one
            if (!request.StartsWith("/"))
                request = "/" + request;

            Return302(client, targetServer, request);
Categories: Blogs

You Rang!

Hiccupps - James Thomas - Fri, 02/03/2017 - 00:01

So, last year I blogged about an approach I take to managing uncertainty: Put a Ring on It.

The post was inspired by a conversation I'd had with several colleagues in a short space of time, where I'd described my mental model of a band I put around all the bits of the problem I can't deal with now, leaving behind the bits that are tractable.

After doing that, I can proceed, right now, on whatever is left. I've encircled the uncertainty with a dependency on some outside factor, and I don't need to think about the parts inside it until the dependency is resolved. (Or the context changes.)

And this week I was treated to a beautifully simple implementation of it, from one of those colleagues. In a situation in which many things might need doing - but the number and nature is unknown - she controlled the uncertainty with a to-do list and a micro-algorithm:
  • do the thing now, completely, only if it's easy and important
  • do a pragmatic piece now, if it's needed but not easy, and revisit it later (via the list) 
  • otherwise, put it on the list

Uncertainty encountered. And ringed with a list. And mental energy conserved. And progress consistently made.
Categories: Blogs

Refactoring Towards Resilience: Evaluating SendGrid Options

Jimmy Bogard - Thu, 02/02/2017 - 17:26

Other posts in this series:

In the last post, we found we had the full gamut of options for coordinating our distributed activity:

Coordination Options

We could ignore the failure (and have money floating in the ether), retry using an idempotency key, undo via a refund, or coordinate using the auth/capture flow Stripe provides. That's quite helpful that we have so many options available, it lets us be flexible in our approach. Part of this is the design of the Stripe API, and part is just the nature of payments.

SendGrid, a service to send email, naturally won't have as many options. The purpose of SendGrid is to send emails when you call their API, and to send that email as quickly as possible. You shouldn't call their API unless you want an email sent. With this in mind, let's look at our coordination options.


With Stripe, the Ignore option wasn't really viable. We need to charge customers, but we don't want to charge them twice. With an email notification, it's not necessarily as critical.

In our original code, any SendGrid failure caused my entire process to fail:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    await sendGridService.SendPaymentSuccessEmailAsync(order);
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");

But what if it doesn't have to be this way? Does a SendGrid failure really need to take my entire payment process down? Should I not take people's money because I can't send them an email? Maybe not! The simplest approach would be just to ignore (and log) failures:

public async Task<ActionResult> ProcessPayment(CartModel model) {  
    var customer = await dbContext.Customers.FindAsync(model.CustomerId);
    var order = await CreateOrder(customer, model);
    var payment = await stripeService.PostPaymentAsync(order);
    try {
        await sendGridService.SendPaymentSuccessEmailAsync(order);
    } catch (Exception e) {
        Logger.Exception(e, $"Failed to send payment email for order {order.Id}");
    await bus.Publish(new OrderCreatedEvent { Id = order.Id });
    return RedirectToAction("Success");

We'd still want to send emails, but we can be notified of these exceptions and have some manual (or automated) process to retry emails.


Unlike Stripe, SendGrid does not support any sort of idempotency in requests. This is intentional on their part, they want to send emails as soon as possible and introducing any sort of idempotency check would introduce a delay in the send.

All this means that Retrying a SendGrid call would result in two emails sent. Is this acceptable? Probably not - if I received two "payment successful" emails, I would rightly freak out and assume my card was charged twice.

Fundamentally, my emails are SMTP, another messaging protocol, which just doesn't support something like idempotent Retry. If I wanted true retries, I would need to build a resiliency layer on top of sending emails, something that SendGrid (and most email service providers I can think of) do not do.

No retries.


Similar to Undo, there's really no such thing as "undoing" an email sent. Perhaps a follow-up email "ignore that last email", but that's about it.

SendGrid does however support scheduling emails in a batch with an ID, and cancelling a batch. However, there are limitations to batches sent, and batches are as the name implies a means to send a bunch of emails at a time in a batch.

I don't really want to be tied to scheduling a batch of emails together, as the customer expectation is to get an email relatively soon after send, and I can't create a batch for every single email.

No undos.


Similar to Retry, there's really no facility to coordinate an email send with some sort of two-phase commit process. I'd have to build something on top of calling the SendGrid API to coordinate this action.

No coordination.

So where does that leave us? In our current state, we can either Retry or Ignore. Neither are great solutions, but we'll come back soon on better ways to tackle this problem with messaging.

Next up, RabbitMQ failures!

Categories: Blogs