"I couldn’t possibly list [...] all the crucial changes we made to make [#MicrosoftEdge] possible. Dozens of big initiatives and literally thousands of smaller improvements (and removals!) were needed to get us here. I certainly can’t say that our journey was 100% free of stumbles, because no worthwhile journey is, but I can say that in my 27 years at MS I’ve not seen anything like this combination of challenges: engineering, organizational, and operational. Any reasonable person could look at those challenges and conclude that no group could reasonably be expected to succeed with so many things in flight. But we did it anyway!"
"As the perf guy, it’s my job to tell you every day how bad everything is, that’s pretty much the gig. But today, just this one time, I’m gonna tell you the rest of what I’m thinking – it’s pretty smokin’ guys. You should be proud of what you’ve accomplished – I know I am."
This link has the details.
Approach #1 -- Non-deterministic Finite Automaton
- This approach first converts the regular expression into a non-deterministic state machine, which is just like a state machine except you have to keep track of all the states you might be in. Sort of like when you have a choice of going left or right you go both ways and remember it could be either. The state machine you get from this approach is proportional to the size of the expression in a fairly direct way and does not depend on the text you will match at all, just the regular expression. The maximum number of states you might have to track at any given moment is likewise determined by the expression in a fairly direct way and again does not depend on what you are matching, just the pattern.
- This is the "medium" approach to the problem, and it is popular in command line tools like grep/sed etc. Many text editors use this approach internally.
- Typically a library that does this offers a compile step to compile the RE first into a machine and then a match method to do the matching.
- The time to match an expression is linear in the size of the input.
- In this approach you basically post-process the NDFA from step number 1 into a regular state machine. You can always do this, you can imagine that if in the NDFA you were in the state [maybe #1, maybe #7, maybe #22] that in the DFA you're in a single state called [maybe_1_maybe_7_maybe_22], so there is one distinct state for every combination you might have been in. The advantage of this approach is that once you have set up the DFA the amount of housekeeping you have to do at each step is much smaller.
- Like in #1 your final run time is not affected by the complexity of the expression, but only by the size of the text you are matching, linearly. Also like #1 there is a compile step which you have to do once per expression, so the total cost includes both compilation and matching.
- Command line tools like "egrep" and "lex" use this approach. Here's the thing, if the tool is something like "lex" then you only pay the cost of the compilation once when you generate your machine. Your already cooked state machine is then compiled directly into your program. So that's a cool trade off.
- The downside is that the number of states you could need can get very large indeed. In the worst case if the NDFA had N states then the DFA would have 2^N states. But usually that doesn't happen.
- This is the "Heavy" approach in terms of initial cost
- This doesn't create a machine, and in fact you don't need any extra state other than a stack. In this approach when faced with a pattern like A|B you first try A and if that doesn't work you rewind and try B. This is all fine and well except if you have even a simple pattern like .*.*X, you find yourself trying to swallow all you can with the .* by which time you're way past the X so you rewind but of course the next .* then grabs way too much, so you rewind that... Now you can imagine if you had a very long line that didn't match X in it at all you could spend all kinds of time trying to eat different combinations of leading stuff... This pattern will be N^2 in the size of the input. Now this kind of stuff happens all the time... and you can make it much, much worse (.*.*)*X just compounds the problem. If that pattern looks too unrealistic to you how do you feel about ".*=.*+.*;" to match anything that looks like X=Y+Z; that pattern is a cosmic disaster...
- Thing is for very simple patterns #3 will often do better because there's no compilation. It could get bad but often it doesn't. Some libraries do this.
- This is the "Light" approach in terms of initial cost
For more details consider these references: