Yes. Modify Links does support regular expressions. Below is an introduction to regular expressions and how to use them.
Introduction to Regular Expressions
In the context of the “normal world” (as opposed to computerese) the term “Regular Expression” has very little relevance to the meaning of the word “regular” as used in normal English conversation. It is just a name that had some meaning within the academic computer society several decades ago, and has been carried over from then.
Regular expressions are a very powerful tool for finding and replacing text strings that have similar characteristics. Using regular expressions, you can literally pick a text string apart into as many pieces as you want, then reassemble them in any order you want, while adding any new pieces you need in the process.
- A regular expression is made up of special characters and literals.
- A “special character” is defined as “a single character that has special meaning when used in a regular expression.” The period (.) is a special character.
- A “literal” is defined as “anything that is not a special character.” The letter ‘A’ would be a literal.
- The verb “to match” is used extensively in this chapter. It means, “to accept as equivalent to.”
Special Characters
Special characters are single characters that have special meaning when used in a regular expression.
Period
A period (“.”) matches any character except for the new line character (\n). Thus “a..” matches “act”, “art”, “ash”, “arc”, “ate” and so on.
Asterisk
An asterisk (“*”) matches any number of repetitions of the immediately preceding character or pattern, including zero repetitions. For example, “tel*” matches “tel” followed by any number of l’s or no l’s. Thus, it would find matches within the strings “the tel. number”, “to tell the truth” and “answer the telephone”.
Backslash
A backslash (“\”) reverses the “special” meaning of a character. For example, “\.” matches a period, “\*” matches an asterisk, and “\\” matches a backslash. There are some cases where the backslash changes the meaning of a character from literal to special. These cases are defined later. If used before a character that is not a special character, or that would have no meaning as a special character, the backslash is simply ignored.
Caret
A caret (“^”) matches the beginning of a string. “^Once upon a time” will find a match within “Once upon a time there were three bears”, but not within “The story began once upon a time.”
With a single exception, if the caret occurs anywhere except the beginning of a regular expression, it is interpreted as a literal. (The single exception to this is defined later.)
To match a line starting with a caret sign, you must specify a “special character” caret followed by a “literal” carat. Thus “^\^.” would match the first two characters of any string whose first character was a caret.
Dollar Sign
A dollar sign (“$”) matches the end of a string. “the end$” will find a match within “He reached the end”, but not within “He reached the end of the road.” Note that it would not find a match within “He reached the end.” because of the period at the end of the string.
If a dollar sign occurs anywhere but at the end of a regular expression, it is interpreted as a literal character. Thus “$100” will only match “$100”, treating the dollar sign as a literal.
To match a string ending with a dollar sign, you must specify a “literal” dollar sign, followed by a “special character” dollar sign. “.\$$” would match the last two characters of any string whose last character was a dollar sign.
Brackets
Brackets (“[“ and “]”) are used to delimit a “set” of characters, any one of which may match a single character in the search string. Thus, “[Aa]” will match any occurrence of the letter “a”, either upper or lower case, and “[0123456789]” will match any single digit.
When the caret is used as the first character after a left square bracket, it reverses the meaning of the search. “[^0123456789]” will match any character except a digit. And “^[^0123456789]” will match the first character of a string as long as it is not a number. Note that in this case, the caret is used with both its special meanings.
Hyphens
The hyphen (“-”) has special meaning when used within brackets. It indicates a range of characters and must have a character on each side of it to be valid. “[A-Z]” will match any upper case letter), “[a-z]” will match any lower case letter, or “[0-9]” will match any digit. “[a-zA-Z]” will match any letter, either upper or lower case.
When found outside of brackets, the hyphen is interpreted as a literal character.
To include the hyphen in a set of characters to search for, precede it by a backslash. Thus “[+\–]” matches either a plus or minus sign, whereas “[+–]” would be an invalid regular expression.
Repetitions
The repetition codes are used to allow the same pattern to be matched multiple times. The symbols used are “{“ and “}”.
The strict technical definition of a regular expression pattern is “a sequence of one or more special characters and/or literals that will match zero or more repetitions of a single character or set of characters”. Now that’s quite a mouthful, so it’s easiest to think of it as from the other direction. Essentially it is “a matched string of characters.”
The formats are:
“p{x}” matches exactly x repetitions of pattern p.
“p{x, }” matches at least x repetitions of pattern p.
“p{x,y}” matches any number of repetitions of pattern p, from x to y, inclusive.
x and y must be non-negative integers less than 256.
Whenever a choice exists, as many occurrences of the pattern as possible will be matched.
Probably the most common usage of the repetition codes is in matching numbers, as in “[0-9]{1,}”, which will match “1”, “33” and “23496187” alike.
Regular Expression Substitution Example
This is a very simple, straight substitution. This same thing can be done using Wildcard Substitution.
Search string: “Ave\.”
Replacement string: “Avenue”
Search string: “123 4th Ave.”
Result string: “123 4th Avenue”
The use of “\.” indicates that the period should be taken literally, as opposed to matching any single character. Without the backslash, (as “Ave. “) it would match “Ave Maria” and change it to “AvenueMaria.” (Note that the space between “Ave” and “Maria” is matched by the period and is therefore replaced along with the “Ave.”)
Segments
One of the more powerful aspects of using regular expressions is the ability to record sections of an input string that matched particular sections of the regular expression.
The regular expression format is extended through the use of “segments” via the “(“ and “)” operators which may be placed around any section of the regular expression.
( Denotes the beginning of a segment.
) Denotes the end of a segment.
There can be multiple segments per regular expression. They can be referenced later as numbered strings as “$#” or $1, $2, $3 etc.
Use of the caret (^) and dollar sign ($) primitives to match line beginnings and endings respectively must occur outside of any segments. The following expression will treat both the caret and dollar sign as literal characters:
“(^This expression$)” and will not match the string “This expression”, but will match “^This expression$”.
The following will treat both the caret and dollar sign as the line terminators:
“^(This expression)$” and will match the string “This expression”, but not “Is This expression correct?”
Examples
Given a filename of “C:\Folder\File.gif”, the regular expression search string of “(.*\).gif” and the replacement string of “$1.jpg”, the resulting filename would be “C:\Folder\File.jpg”. The #1 segment, indicated by “$1” in the replacement string, is all the text before “.gif”.
Given the input text of “I like cats. I love dogs.”, the regular expression string of “(.*)cats\. (.*)dogs\.” and the replacement string of “$2fish. $1mice.”, the output would be “I love fish. I like mice.” The #1 segment being all the text that comes before “cats” and the #2 segment all the text that comes after “cats.” and before “dogs.”. These two segments are reversed in order in the output string and merged with “fish” and “mice”.
Tips and Tricks
The wildcard character “*” can hand out some surprises if one isn’t careful. The two things to remember are: 1) it does not work alone — it always applies to the character or pattern immediately preceding it, and 2) it will match regardless of whether the preceding pattern exists. This is the only code that will actually match the absence of something.
Likewise the “[“ and “]” codes can be tricky. The thing to remember is that the entire pattern, from the opening through the closing brackets, represents just one character.
Combine those two tips, “[ ]*” (with a space between the brackets) is a useful construct when you’re not sure if there will be a space between two characters in a string, such as “100ft” and “100 ft”. You can also think of this as matching “zero or more spaces”.
Numbers
Matching numbers can be frustrating because of the great variation in formats. So here are some examples of some common formats:
“[0-9]{1,}” — match whole numbers
This will match any contiguous sequence of digits.
“[0-9][0-9,]{3,}” — match whole numbers greater than 1,000, with or without comma separators
This will match “1,000”, “1000” and “999,999,999,999”
It will not match “10”, “999.” or “100”
“[0-9][0-9,]{1,}\.[0-9]{1,}” — match only numbers having decimal fractions, with or without comma separators
This will match “10.0”, “100.001” and “1,234.567”
It will not match “123”, “123.” or “.001”
“[0-9]{1,}[ ][0-9]{1,}/[0-9]{1,}” — match only mixed numbers, without comma separators
This will match “3 1/2”, “8 7/16” and “2271 17/22”
It will not match “1,000 1/2” or “3/4”
“[0-9]{1,}˚[ ]*[0-9]{1,}΄[ ]*[0-9]{1,}˝” — match degree/minute/second bearings
This will match “10˚ 17΄ 33˝” (with spaces) and “10˚17΄33˝” (without spaces)
It will not match “10˚ 17΄” or “17΄ 33˝”.
“[\+\-][0-9]{1,}” — match only whole numbers, without comma separators, that are directly preceded by a plus or minus sign
This will match “-1”, “+1000” and “-100000”
It will not match “-1,000” or “1000”
“[xX]*[0-9A-Fa-f]{2,}” — match hexadecimal numbers of two or more digits optionally preceded by an upper or lower case “X.”
This will match “xF0”, “x00ff”, “8010”, “abc” and “0123456789aBcDeF”
It will not match “F”, “0” or “xyz”
Although might seem tempting to use “[0-9,]{1,}” to match comma separated numbers, it is actually not of much use because it will match any comma, plus other things that are not valid numbers:
“ , “
“1,2,3,4,5”
“101,,,101,,,101,,,”
“,,,,,,,,,,”
“,1”