Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sat solver on top of regex matcher (yurichev.com)
91 points by justinucd on June 22, 2020 | hide | past | favorite | 52 comments


> Another practical usage I've heard: match "string" or 'string', but not "string'.

You don't need backreferences for that:

    '[^']*'|"[^"]*"


The common case of only two pairs of quotes is indeed regular, but if you want to support either all Unicode quotes (about 60 pairs of them) or C++11 raw string literals `R"delim(...)delim"` (intrinsically not regular) you are out of luck.


For any finite set of quotation character pairs you can get away with a strategy like `(left_char1)[^right_char1](right_char1)|(left_char2)[^right_char2](right_char2)|...`. Escape characters aren't much harder to accommodate.


Of course, but you are out of luck in terms of complexity. (Colloquial) regular expressions lack any kind of abstractions.


Absolutely. I don't for a moment think they're the right tool for the job.

They are fairly powerful in terms of what they're capable of parsing however (not enough for an arbitrary html document, but enough to handle the hairier situations in this thread that people thought they couldn't), and that does mean that a regular expression generator can handle all of those situations as well and potentially be much more readable.

If I found myself writing code like this I'd still want to reach for a better parsing technology, but you can use other languages to add abstractions to regex. Here's a Python3.6+ example assuming any desired backslashes have already been applied:

  '|'.join(rf'{a}[^{b}]*{b}' for a,b in pairs)


Won't work if you're already in a string, or if there are escaped quotes in the string. Also won't work if you have two or more double quoted strings that both contain an apostrophe.


Escaping isn't an intrinsic property of all quoted strings (e.g. single quoted strings in bash), but even so one can work around them without backreferences, by searching for anything that's not a quote or a backslash, _or_ any escaped character:

    /"([^"\\]|\\.)*"/
Now double that up with a single quote version if you wish.

What you can't match without backreferences, however, is strings with customisable terminators, e.g. the behaviour in sed that whatever character you use after `s` is the regex terminator (it doesn't have to be `/`), or raw strings in C++.


Backreferences don't really help with those problems.

> Won't work if you're already in a string

This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.

> or if there are escaped quotes in the string.

Solvable:

    '(\'|\\|[^\'])*'|"(\"|\\|[^\"])*"
> Also won't work if you have two or more double quoted strings that both contain an apostrophe.

The regex in my previous comment already solves that. See: https://repl.it/repls/SolidCapitalProgram


Raku allows strings inside of strings. Of course it does this by way of embedded closures.

    "abc{ "def" }"
Which allows it to be arbitrarily deep.

    "a{ "b{ "c{ "d{ "e{ "f" }g" }h" }i" }j" }k"
    → "abcdefghijk"
This can be handy to generate the correct string.

    my $count = 3;
    "I went to $count place{ "s" if $count ≠ 1 } today"


Interesting, thanks for pointing out a use case. But I don't think backreferences will help with that, it needs to be parsed by something more powerful than a regex.

And that example reminds me that Bash can do something similar:

    echo "$(echo "$(echo "$(echo "hi")")")"


The Rakudo implementation actually uses Raku regexes to parse Raku. To be fair though it is a lot easier to do that with the redesigned regexes that Raku has.

Basically you can use backreferences for that if you also allow the regex to be recursive.

    my $regex = /
      :ratchet
      $<q> = (<["']>) # the beginning quote

      {}:my $q = ~$<q>; # put it into a more normal lexical var

        # capture between " and {
        $<l> = ( [ <!before $q> <-[{}]> ]* )

        [
          [
            :sigspace
            「{」
                <self=&?BLOCK>? # recurse
            「}」
          ]

          {$q = ~$<q>}

          # capture between } and "
          $<r> = ( [ <!before $q> <-[{}]> ]* )
        ]?

      "$q" # match the end quote

      # pass the combined string parts upwards
      { make ($<l> // '') ~ ($<self>.ast // '') ~ ($<r> // '') }
    /;

    「'a{ "b{ "c{ "d{ 'e{ "f" }g' }h" }i" }j" }k'」 ~~ /^ <r=$regex> $ { make $<r>.ast }/;

    say $/.ast;
    # abcdefghijk
Note that `Regex` is a subtype of `Block`. That is why `&?BLOCK` can be used as a reference to the regex itself.

`<foo=bar>` is a way to call `bar`, but also save it under the name of `foo`. `$<foo> = …` is a way to capture `…` and save it under the name of `foo`.

---

It is a lot nicer and modular when you use regexes as part of a grammar:

    # use Grammar::Tracer;
    grammar String::Grammar {
      token TOP { <strings> }

      rule strings {
        # at least one string
        # if there are more than one they are separated by ~
        <string> + % 「~」
      }

      token string {
        $<q> = <["']>

        # set a dynamic variable to the quote character
        {}:my $*quote = ~$<q>;

        <string-part>*

        "$<q>"
      }

      # multiple tokens that act like one
      # which is nicer than using |
      proto token string-part {*}
      multi token string-part:<non> {
        [ <-[{}]> <!after $*quote> ]+
      }
      multi token string-part:<block> {
        <block>
      }

      rule block {
        「{」 ~ 「}」 <strings>?
      }
    }

    class String::Actions {
      method TOP     ($/) { make     $<strings>.ast }
      method strings ($/) { make [~] @<string>».ast }
      method string  ($/) { make [~] @<string-part>».ast }
      method block   ($/) { make     $<strings>.ast }

      method string-part:<non>   ($/) { make ~$/ }
      method string-part:<block> ($/) { make $<block>.ast }
    }

    say String::Grammar.parse(
        「"a{ "b{ "c{ "d{ "e{ "f" }g" ~ "zz" }h" }i" }j" }k"」,
        :actions( String::Actions ),
    ).ast;
    # abcdefgzzhijk
A `token` is just a `regex` with `:ratchet` mode turned on. (prevents backtracking) A `rule` is just a `token` with `:sigspace` also turned on. (makes it easier to deal with optional whitespace.)

Every instance of `<foo>` is basically a method call.

`make` is about generating an `.ast` to pass up and out of the parse. In this case the only thing the actions class does is return what would be the resulting string if it were compiled in Raku.


Re: already in a string, one of the primary uses of regex is to search from point in a text editor. So, cursor is in a string and you want to find the next string. Regex won't work on its own, you generally need more semantic information to differentiate opening & closing quotes (unless you can use local context from that particular language to infer it).

But more broadly, any situation where you search from a non-zero index has this problem.

I'm surprised your example works in Python. Is that a property of Python's parser, or all regex matchers?


> Regex won't work on its own, you generally need more semantic information

Yeah, I agree. My point was that regex won't work, regardless of if you have backreferences or not. So backreferences won't help.

> But more broadly, any situation where you search from a non-zero index has this problem.

I'm not sure I understand that. A lot of regex libraries let you specify a start index. It won't take into account data from before the start index though (regex doesn't really do that, regardless of backreferences). If your regex library doesn't support passing in a start index, you can just take a substring starting at that index, then search the substring.

I don't think Python is really special. Python's findall() is just a convenience function that does a loop finding a match, then finding another match that starts after the first match, etc. Most languages provide a way to find the end point of the most recent match, and then you can just write the loop yourself to start the next search at that point.


> This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.

    query = "select * from table where name like \"%foo\""


Interesting. Although in that situation I think it would be easier to find the outer string, then unescape it, then find the inner string.

Although if you want to do it with pure regex, it can be done without backreferences too, although it would be exponentially large as you get more and more levels of nesting, whereas with backreferences I think it would only get quadratically large.


As some other replies pointed out, there are straightforward modifications to handle all those scenarios if those are your requirements instead. A place where regex _does_ fail is in arbitrarily nested string interpolations (the key being _arbitrary_ nesting because with enough time anyone can come up with a convoluted enough regex to handle a bounded degree of recursion).


That wasn’t specified in the requirements


Good point.


At some point you'd want to pull out a parser. Maybe 1 more step after this point.


That is quite literally a formal proof that "regex+backreferences" is NP-complete, since SAT is the index NP-complete problem.


This line is super smart, and yet, despite I should know a lot about regex and NP-complete, my head feels dizzy as I try to make full sense of it. A sign I'm getting old or dumb, perhaps :(

Jokes apart: I'd love for you to elaborate a bit more on this. I'm pretty sure I would benefit a lot from a more expanded, "dumber" explanation.


>> That is quite literally a formal proof that "regex+backreferences" is NP-complete, since SAT is the index NP-complete problem.

> I'd love for you to elaborate a bit more on this.

I'm not the original poster, but I'll have a go.

SAT is NP-Hard. In other words, any literally any NP problem can be efficiently[0] converted to SAT, and any solution can then be efficiently[0] converted back to a solution to the original.

Example: Think of the problem of factoring integers. Someone gives you an integer to factor, with a little work you can create a SAT instance, solve that, and then read off the factorisation of the original integer. SAT is, in some real sense, at least has hard as INT.

So there is a proof that SAT is at least as hard as every NP problem. That's what we call "NP-Hard".

Now someone has shown that they can solve SAT problems by using regex_backtrack. That means that every NP problem can be converted to SAT, then converted to regex+backtrack, solved, and the solution to the original read out from the result.

Thus regex+backtrack is at least as hard as every NP problem.

Now in the case of SAT, it itself is NP. So the combination of being NP and being NP-Hard is called "NP-Complete", or NPC. So SAT is an example of a problem that's NPC.

What has not been shown (I think) is that regex+backtrack is in NP. Showing that a solution to regex+backtrack implies a solution to SAT shows that regex+backtrack is NP-Hard.

If the linked article also shows that regex+backtrack is NP, then it is therefore NPC. But we can see that regex+backtrack is in NP, because verifying an alleged match is a polynomial time operation.

So regex-backtrack is NPC.

             +--------------------+
             |                    |
  NP-Hard -> |                    |
             |   ,------------.   |
              \ /              \ /
               X  NP-Complete   X
              / \              / \
             /   `------------'   \
      NP -> |                      |
            |                      | 
            .    +------------+    ,
             \   | Polynomial |   /
              `--+------------+--'
[0] For a technical definition of "efficient"


Am I missing something? I read it the other way: all CNF instances can be rewritten as regexp + backreferences, meaning re + backreferences are at least as general that SAT, not at most as. Meaning, they could be higher in the polynomial hierarchy.


As always, with all these things, there's a non-zero chance that I've mis-spoken myself somewhere. I'm going to "think out loud" on this so people can follow the thought processes.

> I read it the other way:

OK ...

> all CNF instances can be rewritten as regexp + backreferences,

By CNF you are referring to instances of the SAT problem. So yes, if you have an instance of the SAT problem, it can be re-written as an instance of regex+backtrack.

> meaning re + backreferences are at least as general that SAT,

Yes, the regex+backtrack problem is at least as hard as the SAT problem.

> ... not at most as.

Where did I say that? Here's a stripped-down summary of my comment:

* SAT is NP-Hard.

* Now someone has shown that they can solve SAT problems by using regex+backtrack. (That's the linked article)

* Thus regex+backtrack is at least as hard as every NP problem.

* SAT is NP, so it's NPC

* regex+backtrack can be seen to be in NP.

* So regex-backtrack is NPC.

So rewording that:

* The linked article shows regex+backtrack >= SAT.

* Independently we observe that checking an alleged regex+backtrack solution is a polynomial task, therefore regex+backtrack is in NP.

* SAT is in NPC, therefore regex+backtrack <= SAT (because regex+backtrack is in NP).

* Thus regex+backtrack = SAT (for some definition of "=")

So, I think you must have misread something ... I think everything I've written is correct as stands.


> we observe that checking an alleged regex+backtrack solution is a polynomial task

That's the point I missed at first. That's good news, because I was pretty sure perl regex were accidentally Turing complete, I don't know why.


Cool.

This sort of thing can be really tough to follow because it's all deeply intertwungle. Glad I got it right.

Cheers!


I assume this result is already known in the literature?


This doesn't show a date but archive.org has snapshots dating back to 2001

https://perl.plover.com/NPC/


Proving that a problem is NP-complete requires proving that it is NP-hard and in NP. Reducing SAT to regex matching with backreferences does the former. The latter requires proving that any solution can be checked in polynomial time.

The author is also incorrect in stating However the author incorrectly states that only 3SAT problems are solvable. Proof that 3SAT is NP-hard does not exclude broader SAT.


While not a formal proof, it is fairly obvious that verifying a match is in P. Just fill in the captured groups from the answer in the regex and see if it corresponds to the input text.

Every SAT problem can be converted into a 3-SAT problem so that's also not really an issue, they are both NPC


This reduction is really cool. I love reductions like this.

Is there a general consensus to use "regular expression" to refer to the actual regular ones and "regex" to refer to the non-regular variants?


I think Raku (neé Perl 6) has been spearheading that distinction

https://docs.raku.org/language/regexes (see the intro paragraph)


This distinction was introduced by Jeffrey Friedl’s book _Mastering Regular Expressions_ (2006), and it seems to be fairly commonly used now.


Too late to edit, but out of curiosity I re-implemented this to use the PCRE JIT (in PHP) to see what kind of speedup it would provide: https://gist.github.com/allanlw/69df509519335b88db886d48503a...

Timings for fred.cnf on my machine:

python: 0m53.744s

PHP (no PCRE JIT): (hits backtrack limit in 1m15.994s)

PHP (PCRE JIT): 0m20.109s


I wouldn't say so, but I use the term "regular language" if I mean the mathematical concept.


I don’t think it’s pedantic to say that a regular language is not the same thing as a regular expression. The difference between syntax and semantics is real and important.


(late reply) Right that's what I'm saying. Who said it was pedantic? :)


But a "regular language" is not the same as "regular expression" as mathematical concepts.


(late reply) Hm you've negelected to do the obvious thing and explain the difference.

I don't think there's a difference, and if there is one, it's probably not relevant to programming. Whereas the one I'm highlighting is relevant to programming.


I don't think so. Usually you can tell from the context:

# math and/or computer science texts? It's the regular ones.

# pretty much elsewhere? It's the extended ones.

# threads about parsing HTML with regular expressions? People using both and insisting their version is the only correct one.


Yes, for at least ten years and likely much longer.

https://cstheory.stackexchange.com/q/448/362


Interesting, I wasn't aware of the history.

Personally I make the distinction, but I've noticed many many people do not, hence the question.


People without theoretical CS background probably think of regular expressions (or regexes for short) as just text-matching thingies that sometimes give you two problems and would look at you blankly if you mention the Chomsky hierarchy. When we don’t know there’s a distinction to be made, we go for economy of expression.


Isn't it PCRE that we started to call it regex?


One of the cool features of SAT problems is that they always terminate (if you're patient enough). Aren't regex, especially with backreferences, Turing-complete though? If so, they could be caught in an infinite loop, meaning they are more general than the SAT problem.


Programming languages are more general than the problems they solve. (= feature, not bug)

Still, yes, you can mess up your "add 1 to the input" program and make it run infinitely.


Yeah, I meant it the other way, if those regexps are Turing-complete, not all of them have an equivalent CNF representation, contrarily to what the article seems to state in its first paragraph (and title).

That being said, regexps were not initially meant to be "programming languages", so I'm not sure about the "feature, not bug" part. I'd rather have a notation that would let me solve, for instance, the "HTML tag matching" problem and would be guaranteed to always terminate, than one that also lets me implement Conway's game of life.


That "popcnt1" is also known as a 1-hot constraint.

https://en.m.wikipedia.org/wiki/One-hot


time python3 solver.py fred.cnf

Took 9min and 10seconds on RPi 3 running Ubuntu 20.04. Consuming 100% CPU and 1% RAM (1024MB).


This is actually amazing. My python programs rarely run at 100% cpu, whereas C++ binaries are usually up there. Always thought python's inefficiency causes the drop in cpu utilization.


I don't know about how efficient it is but I have always been able to peg all cores with the multiprocessing module. Even something useless like "x * x" is more then enough for 800%.


Python's regex implementation is probably not written in Python, so while it's trying to match, no Python code runs; it's all /C(++)?/.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: