A friend of mine was tasked with writing a regular expression that could recognize a Uniform Resource Identifier (URI) and break apart its primary components. Not a terribly difficult task since there are a lot of sample regexs one can just pluck from the Internet. But he asked my opinion and I deferred to RFC 3986 which outlines the generic syntax for URIs. The RFC provides this example expression : ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? Personally though I think that's rather sloppy. Were I to write the expression from scratch myself then I'd probably be more verbose and restrictive; I'd expressly match patterns specified in the RFC's ABNF . For example, the RFC defines the scheme portion of a URI as: scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) The portion ^([^:/?#]+): will match characters that are not lexically permitted, such as an underscore or percent-sign. Assuming the i (case-insensitive matching) modi...
The Blog of Timothy Boronczyk - running my mouth off one blog post at a time