A friend of mine was tasked with writing a regular expression that could recognize a Uniform Resource Identifier (URI) and break apart its primary components. Not a terribly difficult task since there are a lot of sample regexs one can just pluck from the Internet. But he asked my opinion and I deferred to RFC 3986 which outlines the generic syntax for URIs.
The RFC provides this example expression:
This highlights one of the joys of writing regexs; regexs is one of the few areas of computing in which you don't have to be completely accurate to achieve the desired results. A "close-enough" match will suffice most of the time.
While I'm on the topic of regexs, check out gskinner.com's RegExr: Online Regular Expression Testing Tool at www.gskinner.com/RegExr/ if you haven't already. It’s a great utility!
The RFC provides this example expression:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?Personally though I think that's rather sloppy. Were I to write the expression from scratch myself then I'd probably be more verbose and restrictive; I'd expressly match patterns specified in the RFC's ABNF. For example, the RFC defines the scheme portion of a URI as:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )The portion ^([^:/?#]+): will match characters that are not lexically permitted, such as an underscore or percent-sign. Assuming the i (case-insensitive matching) modifier is used, something like ^([A-Z][A-Z\d\+\-\.]*): would be more correct.
This highlights one of the joys of writing regexs; regexs is one of the few areas of computing in which you don't have to be completely accurate to achieve the desired results. A "close-enough" match will suffice most of the time.
While I'm on the topic of regexs, check out gskinner.com's RegExr: Online Regular Expression Testing Tool at www.gskinner.com/RegExr/ if you haven't already. It’s a great utility!
Comments
Post a Comment