Brief Introduction to Regular Expressions

This is a web version for reference of a .docx file originally produced for the Mashcat 2012 session I did called How Big Is My Book. Resurrected to form the Manual for Meret, a regular expressions tutorial based on Marcedit examples.

Literals

Characters as you type them. E.g. i will look for a letter “i”. ii will look for two letter “i”s in a row. Eldorado will look for the exact string “Eldorado”, and 1234s will look for “1234s”.

Types of Character

There are a number of ways of looking for specific types of character:

. looks for any single character. It could be a letter, number, punctuation or anything.

[] looks for any one of the characters in square brackets, so m[ae]rc will match “marc” and “merc”. You can also specify ranges, e.g. [a-z] will find any letter from “a” to “z”, so [a-d]ad will match “aad”, “bad”, “cad”, and “dad”. Putting a ^ after the [ will look for any character that isn’t in the square brackets: u[^ks]marc will not match “ukmarc” or “usmarc” but will match “unmarc”.

\d a digit, same as [0-9]. Like all the following, counts as one character although written as two.

\D not a digit, same as [^0-9].

\w alphanumeric, including underscore, same as [A-Za-z0-9_]

\W non-alphanumeric, same as [^A-Za-z0-9_]

\s whitespace characters, e.g. spaces, tabs

\S non-whitespace characters

\b word boundary: the beginning or end of words (i.e. strings of alphanumeric characters), or the beginning or end of strings.

\ is also used before a special character so you can search on it. E.g, searching on . will look for any character and will match “.”, “d”, or “5”. To look for a full-stop, put \ in front: \..

Starts and Ends

^ matches the start of any string. So, in “marc must die” ^marc will match “marc” but ^must will match nothing.

$ matches the end of any string. So, in “marc must die” die$ will match “die” but must$ will match nothing.

Numbers of Characters

* matches the preceding element zero or more times, e.g. catalogu*ing will match “cataloging”, “cataloguing”, as well as “cataloguuing” and “cataloguuuuuuuuuuing”.

? matches the preceding element zero or one times, e.g. catalogu?ing will match “cataloging” and “cataloguing” but not “cataloguuing”. See also ? below.

+ matches the preceding element one or more times, e.g. catalogu+ing will match “cataloguing”, “cataloguuing”, and “cataloguuuuuuuuuuing”, but not “cataloging”.

{n} matches the preceding element exactly n times, e.g. catalogu{10}ing will match “cataloguuuuuuuuuuing” but not “cataloging”, “cataloguing”, or “cataloguuing”.

{m,n} matches the preceding element at least m times and no more than n times.

? also has a special meaning to restrict matches of multiple characters, e.g. looking for catalog.*ing in “cataloguing is ace. I love cataloguing” will greedily find “cataloguing is ace. I love cataloguing” as the .* matches both “uing is ace. I love catalogu” and “u”. Amending the regular expression to catalog.*?ing will find only “cataloguing”.

Grouping

() groups characters together. This has a variety of uses. The group can be used a single character, e.g. (meta)* looks for the string “meta” zero or more times. It can also be used for capturing smaller parts of the expression for later use, e.g. catalog(.*) will match anything starting “catalog” but will also store what comes afterwards as $1.

| [pipe] allows alternatives either side of it, e.g. marc|rdf will match “marc” or “rdf”. Smaller alternatives can be matched with brackets, e.g. (uk|us)marc will match “ukmarc” or “usmarc” (and if there is a match will store “uk” or “us” as $1).

Regular Expressions in Javascript

To get matches, use string.match(//). The regular expression goes between the forward slashes. Put a g after the second slash to search for all matches, rather than just the first one. Put an i after the second slash to do a case-insensitive search. String.match returns an array of matches, or null if it finds nothing.

var hits = “team”.match(/i/g);

hits is null as there is no “i” in “team”.

var text = “Fox in socks in box on Knox”;
 var hits = text.match(/\w*ox\b/g);

hits is an array of three elements, all a series of words ending in “ox”: [“Fox”, “box”, “Knox”].

To search and replace within string, use string.replace(//, ””). The regular expression goes between the forward slashes. The g and i work in the same way. The string to replace matches with goes after the comma. You can insert subexpressions captured with round brackets by using $1 for the first, $2 for the second, and so on (see Grouping above and the example below). String.replace returns the string with replacements made:

var text = “I love MARC. I think MARC is the future.”;
 text = text.replace(/MARC/g, ”linked data”);

text is now “I love linked data. I think linked data is the future.”

var text = “UKMARC is better than USMARC”;
 text=text.replace(/(.*?MARC) is better than (.*?MARC)/gi, “$2 is better than $1”);

Now, “USMARC is better than UKMARC”. Run the replacement again, and history is reset.

Examples

ISBN (from Thingology blog) ([0-9]{9}[0-9X]|(978|979)[0-9]{10})

UK Postcode (from Wikipedia) (GIR 0AA|[A-PR-UWYZ]([0-9][0-9A-HJKPS-UW]?|[A-HK-Y][0-9][0-9ABEHMNPRV-Y]?) [0-9][ABD-HJLNP-UW-Z]{2})