The common case of only two pairs of quotes is indeed regular, but if you want to support either all Unicode quotes (about 60 pairs of them) or C++11 raw string literals `R"delim(...)delim"` (intrinsically not regular) you are out of luck.
For any finite set of quotation character pairs you can get away with a strategy like `(left_char1)[^right_char1](right_char1)|(left_char2)[^right_char2](right_char2)|...`. Escape characters aren't much harder to accommodate.
Absolutely. I don't for a moment think they're the right tool for the job.
They are fairly powerful in terms of what they're capable of parsing however (not enough for an arbitrary html document, but enough to handle the hairier situations in this thread that people thought they couldn't), and that does mean that a regular expression generator can handle all of those situations as well and potentially be much more readable.
If I found myself writing code like this I'd still want to reach for a better parsing technology, but you can use other languages to add abstractions to regex. Here's a Python3.6+ example assuming any desired backslashes have already been applied:
Won't work if you're already in a string, or if there are escaped quotes in the string. Also won't work if you have two or more double quoted strings that both contain an apostrophe.
Escaping isn't an intrinsic property of all quoted strings (e.g. single quoted strings in bash), but even so one can work around them without backreferences, by searching for anything that's not a quote or a backslash, _or_ any escaped character:
/"([^"\\]|\\.)*"/
Now double that up with a single quote version if you wish.
What you can't match without backreferences, however, is strings with customisable terminators, e.g. the behaviour in sed that whatever character you use after `s` is the regex terminator (it doesn't have to be `/`), or raw strings in C++.
Backreferences don't really help with those problems.
> Won't work if you're already in a string
This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.
> or if there are escaped quotes in the string.
Solvable:
'(\'|\\|[^\'])*'|"(\"|\\|[^\"])*"
> Also won't work if you have two or more double quoted strings that both contain an apostrophe.
Interesting, thanks for pointing out a use case. But I don't think backreferences will help with that, it needs to be parsed by something more powerful than a regex.
And that example reminds me that Bash can do something similar:
The Rakudo implementation actually uses Raku regexes to parse Raku. To be fair though it is a lot easier to do that with the redesigned regexes that Raku has.
Basically you can use backreferences for that if you also allow the regex to be recursive.
my $regex = /
:ratchet
$<q> = (<["']>) # the beginning quote
{}:my $q = ~$<q>; # put it into a more normal lexical var
# capture between " and {
$<l> = ( [ <!before $q> <-[{}]> ]* )
[
[
:sigspace
「{」
<self=&?BLOCK>? # recurse
「}」
]
{$q = ~$<q>}
# capture between } and "
$<r> = ( [ <!before $q> <-[{}]> ]* )
]?
"$q" # match the end quote
# pass the combined string parts upwards
{ make ($<l> // '') ~ ($<self>.ast // '') ~ ($<r> // '') }
/;
「'a{ "b{ "c{ "d{ 'e{ "f" }g' }h" }i" }j" }k'」 ~~ /^ <r=$regex> $ { make $<r>.ast }/;
say $/.ast;
# abcdefghijk
Note that `Regex` is a subtype of `Block`. That is why `&?BLOCK` can be used as a reference to the regex itself.
`<foo=bar>` is a way to call `bar`, but also save it under the name of `foo`.
`$<foo> = …` is a way to capture `…` and save it under the name of `foo`.
---
It is a lot nicer and modular when you use regexes as part of a grammar:
# use Grammar::Tracer;
grammar String::Grammar {
token TOP { <strings> }
rule strings {
# at least one string
# if there are more than one they are separated by ~
<string> + % 「~」
}
token string {
$<q> = <["']>
# set a dynamic variable to the quote character
{}:my $*quote = ~$<q>;
<string-part>*
"$<q>"
}
# multiple tokens that act like one
# which is nicer than using |
proto token string-part {*}
multi token string-part:<non> {
[ <-[{}]> <!after $*quote> ]+
}
multi token string-part:<block> {
<block>
}
rule block {
「{」 ~ 「}」 <strings>?
}
}
class String::Actions {
method TOP ($/) { make $<strings>.ast }
method strings ($/) { make [~] @<string>».ast }
method string ($/) { make [~] @<string-part>».ast }
method block ($/) { make $<strings>.ast }
method string-part:<non> ($/) { make ~$/ }
method string-part:<block> ($/) { make $<block>.ast }
}
say String::Grammar.parse(
「"a{ "b{ "c{ "d{ "e{ "f" }g" ~ "zz" }h" }i" }j" }k"」,
:actions( String::Actions ),
).ast;
# abcdefgzzhijk
A `token` is just a `regex` with `:ratchet` mode turned on. (prevents backtracking)
A `rule` is just a `token` with `:sigspace` also turned on. (makes it easier to deal with optional whitespace.)
Every instance of `<foo>` is basically a method call.
`make` is about generating an `.ast` to pass up and out of the parse.
In this case the only thing the actions class does is return what would be the resulting string if it were compiled in Raku.
Re: already in a string, one of the primary uses of regex is to search from point in a text editor. So, cursor is in a string and you want to find the next string. Regex won't work on its own, you generally need more semantic information to differentiate opening & closing quotes (unless you can use local context from that particular language to infer it).
But more broadly, any situation where you search from a non-zero index has this problem.
I'm surprised your example works in Python. Is that a property of Python's parser, or all regex matchers?
> Regex won't work on its own, you generally need more semantic information
Yeah, I agree. My point was that regex won't work, regardless of if you have backreferences or not. So backreferences won't help.
> But more broadly, any situation where you search from a non-zero index has this problem.
I'm not sure I understand that. A lot of regex libraries let you specify a start index. It won't take into account data from before the start index though (regex doesn't really do that, regardless of backreferences). If your regex library doesn't support passing in a start index, you can just take a substring starting at that index, then search the substring.
I don't think Python is really special. Python's findall() is just a convenience function that does a loop finding a match, then finding another match that starts after the first match, etc. Most languages provide a way to find the end point of the most recent match, and then you can just write the loop yourself to start the next search at that point.
> This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.
query = "select * from table where name like \"%foo\""
Interesting. Although in that situation I think it would be easier to find the outer string, then unescape it, then find the inner string.
Although if you want to do it with pure regex, it can be done without backreferences too, although it would be exponentially large as you get more and more levels of nesting, whereas with backreferences I think it would only get quadratically large.
As some other replies pointed out, there are straightforward modifications to handle all those scenarios if those are your requirements instead. A place where regex _does_ fail is in arbitrarily nested string interpolations (the key being _arbitrary_ nesting because with enough time anyone can come up with a convoluted enough regex to handle a bounded degree of recursion).
This line is super smart, and yet, despite I should know a lot about regex and NP-complete, my head feels dizzy as I try to make full sense of it. A sign I'm getting old or dumb, perhaps :(
Jokes apart: I'd love for you to elaborate a bit more on this. I'm pretty sure I would benefit a lot from a more expanded, "dumber" explanation.
>> That is quite literally a formal proof that "regex+backreferences" is NP-complete, since SAT is the index NP-complete problem.
> I'd love for you to elaborate a bit more on this.
I'm not the original poster, but I'll have a go.
SAT is NP-Hard. In other words, any literally any NP problem can be efficiently[0] converted to SAT, and any solution can then be efficiently[0] converted back to a solution to the original.
Example: Think of the problem of factoring integers. Someone gives you an integer to factor, with a little work you can create a SAT instance, solve that, and then read off the factorisation of the original integer. SAT is, in some real sense, at least has hard as INT.
So there is a proof that SAT is at least as hard as every NP problem. That's what we call "NP-Hard".
Now someone has shown that they can solve SAT problems by using regex_backtrack. That means that every NP problem can be converted to SAT, then converted to regex+backtrack, solved, and the solution to the original read out from the result.
Thus regex+backtrack is at least as hard as every NP problem.
Now in the case of SAT, it itself is NP. So the combination of being NP and being NP-Hard is called "NP-Complete", or NPC. So SAT is an example of a problem that's NPC.
What has not been shown (I think) is that regex+backtrack is in NP. Showing that a solution to regex+backtrack implies a solution to SAT shows that regex+backtrack is NP-Hard.
If the linked article also shows that regex+backtrack is NP, then it is therefore NPC. But we can see that regex+backtrack is in NP, because verifying an alleged match is a polynomial time operation.
Am I missing something? I read it the other way: all CNF instances can be rewritten as regexp + backreferences, meaning re + backreferences are at least as general that SAT, not at most as. Meaning, they could be higher in the polynomial hierarchy.
As always, with all these things, there's a non-zero chance that I've mis-spoken myself somewhere. I'm going to "think out loud" on this so people can follow the thought processes.
> I read it the other way:
OK ...
> all CNF instances can be rewritten as regexp + backreferences,
By CNF you are referring to instances of the SAT problem. So yes, if you have an instance of the SAT problem, it can be re-written as an instance of regex+backtrack.
> meaning re + backreferences are at least as general that SAT,
Yes, the regex+backtrack problem is at least as hard as the SAT problem.
> ... not at most as.
Where did I say that? Here's a stripped-down summary of my comment:
* SAT is NP-Hard.
* Now someone has shown that they can solve SAT problems by using regex+backtrack. (That's the linked article)
* Thus regex+backtrack is at least as hard as every NP problem.
* SAT is NP, so it's NPC
* regex+backtrack can be seen to be in NP.
* So regex-backtrack is NPC.
So rewording that:
* The linked article shows regex+backtrack >= SAT.
* Independently we observe that checking an alleged regex+backtrack solution is a polynomial task, therefore regex+backtrack is in NP.
* SAT is in NPC, therefore regex+backtrack <= SAT (because regex+backtrack is in NP).
* Thus regex+backtrack = SAT (for some definition of "=")
So, I think you must have misread something ... I think everything I've written is correct as stands.
Proving that a problem is NP-complete requires proving that it is NP-hard and in NP. Reducing SAT to regex matching with backreferences does the former. The latter requires proving that any solution can be checked in polynomial time.
The author is also incorrect in stating However the author incorrectly states that only 3SAT problems are solvable. Proof that 3SAT is NP-hard does not exclude broader SAT.
While not a formal proof, it is fairly obvious that verifying a match is in P. Just fill in the captured groups from the answer in the regex and see if it corresponds to the input text.
Every SAT problem can be converted into a 3-SAT problem so that's also not really an issue, they are both NPC
I don’t think it’s pedantic to say that a regular language is not the same thing as a regular expression. The difference between syntax and semantics is real and important.
(late reply) Hm you've negelected to do the obvious thing and explain the difference.
I don't think there's a difference, and if there is one, it's probably not relevant to programming. Whereas the one I'm highlighting is relevant to programming.
People without theoretical CS background probably think of regular expressions (or regexes for short) as just text-matching thingies that sometimes give you two problems and would look at you blankly if you mention the Chomsky hierarchy. When we don’t know there’s a distinction to be made, we go for economy of expression.
One of the cool features of SAT problems is that they always terminate (if you're patient enough). Aren't regex, especially with backreferences, Turing-complete though? If so, they could be caught in an infinite loop, meaning they are more general than the SAT problem.
Yeah, I meant it the other way, if those regexps are Turing-complete, not all of them have an equivalent CNF representation, contrarily to what the article seems to state in its first paragraph (and title).
That being said, regexps were not initially meant to be "programming languages", so I'm not sure about the "feature, not bug" part. I'd rather have a notation that would let me solve, for instance, the "HTML tag matching" problem and would be guaranteed to always terminate, than one that also lets me implement Conway's game of life.
This is actually amazing. My python programs rarely run at 100% cpu, whereas C++ binaries are usually up there. Always thought python's inefficiency causes the drop in cpu utilization.
I don't know about how efficient it is but I have always been able to peg all cores with the multiprocessing module. Even something useless like "x * x" is more then enough for 800%.
You don't need backreferences for that: