Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.

Brief Introduction to Regular Expressions

This is a web version for reference of a docx file originally produced for the Mashcat 2012 session I did called How Big Is My Book. Resurrected to form the Manual for Meret, a regular expressions tutorial based on Marcedit examples.

Literals

Characters as you type them. E.g. i will look for a letter “i”. ii will look for two letter “i”s in a row. Eldorado will look for the exact string “Eldorado”, and 1234s will look for “1234s”.

Types of Character

There are a number of ways of looking for specific types of character:

. looks for any single character. It could be a letter, number, punctuation or anything.

[] looks for any one of the characters in square brackets, so m[ae]rc will match “marc” and “merc”. You can also specify ranges, e.g. [a-z] will find any letter from “a” to “z”, so [a-d]ad will match “aad”, “bad”, “cad”, and “dad”. Putting a ^ after the [ will look for any character that isn’t in the square brackets: u[^ks]marc will not match “ukmarc” or “usmarc” but will match “unmarc”.

\d a digit, same as [0-9]. Like all the following, counts as one character although written as two.

\D not a digit, same as [^0-9].

\w alphanumeric, including underscore, same as [A-Za-z0-9_]

\W non-alphanumeric, same as [^A-Za-z0-9_]

\s whitespace characters, e.g. spaces, tabs

\S non-whitespace characters

\b word boundary: the beginning or end of words (i.e. strings of alphanumeric characters), or the beginning or end of strings.

\ is also used before a special character so you can search on it. E.g, searching on . will look for any character and will match “.”, “d”, or “5”. To look for a full-stop, put \ in front: \..

Starts and Ends

^ matches the start of any string. So, in “marc must die” ^marc will match “marc” but ^must will match nothing.

$ matches the end of any string. So, in “marc must die” die$ will match “die” but must$ will match nothing.

Numbers of Characters

* matches the preceding element zero or more times, e.g. catalogu*ing will match “cataloging”, “cataloguing”, as well as “cataloguuing” and “cataloguuuuuuuuuuing”.

? matches the preceding element zero or one times, e.g. catalogu?ing will match “cataloging” and “cataloguing” but not “cataloguuing”. See also ? below.

+ matches the preceding element one or more times, e.g. catalogu+ing will match “cataloguing”, “cataloguuing”, and “cataloguuuuuuuuuuing”, but not “cataloging”.

{n} matches the preceding element exactly n times, e.g. catalogu{10}ing will match “cataloguuuuuuuuuuing” but not “cataloging”, “cataloguing”, or “cataloguuing”.

{m,n} matches the preceding element at least m times and no more than n times.

? also has a special meaning to restrict matches of multiple characters, e.g. looking for catalog.*ing in “cataloguing is ace. I love cataloguing” will greedily find “cataloguing is ace. I love cataloguing” as the .* matches both “uing is ace. I love catalogu” and “u”. Amending the regular expression to catalog.*?ing will find only “cataloguing”.

Grouping

() groups characters together. This has a variety of uses. The group can be used a single character, e.g. (meta)* looks for the string “meta” zero or more times. It can also be used for capturing smaller parts of the expression for later use, e.g. catalog(.*) will match anything starting “catalog” but will also store what comes afterwards as $1.

| [pipe] allows alternatives either side of it, e.g. marc|rdf will match “marc” or “rdf”. Smaller alternatives can be matched with brackets, e.g. (uk|us)marc will match “ukmarc” or “usmarc” (and if there is a match will store “uk” or “us” as $1).

Regular Expressions in Javascript

To get matches, use string.match(//). The regular expression goes between the forward slashes. Put a g after the second slash to search for all matches, rather than just the first one. Put an i after the second slash to do a case-insensitive search. String.match returns an array of matches, or null if it finds nothing.

var hits = “team”.match(/i/g);

hits is null as there is no “i” in “team”.

var text = “Fox in socks in box on Knox”;
 var hits = text.match(/\w*ox\b/g);

hits is an array of three elements, all a series of words ending in “ox”: [“Fox”, “box”, “Knox”].

To search and replace within string, use string.replace(//, ””). The regular expression goes between the forward slashes. The g and i work in the same way. The string to replace matches with goes after the comma. You can insert subexpressions captured with round brackets by using $1 for the first, $2 for the second, and so on (see Grouping above and the example below). String.replace returns the string with replacements made:

To search and replace within string, use string.replace(//, ””). The regular expression goes between the forward slashes. The g and i work in the same way. The string to replace matches with goes after the comma. You can insert subexpressions captured with round brackets by using $1 for the first, $2 for the second, and so on (see Grouping above and the example below). String.replace returns the string with replacements made:

var text = “I love MARC. I think MARC is the future.”;
 text = text.replace(/MARC/g, ”linked data”);

text is now “I love linked data. I think linked data is the future.”

var text = “UKMARC is better than USMARC”;
 text=text.replace(/(.*?MARC) is better than (.*?MARC)/gi, “$2 is better than $1”);

Now, “USMARC is better than UKMARC”. Run the replacement again, and history is reset.

Examples

ISBN (from Thingology blog) ([0-9]{9}[0-9X]|(978|979)[0-9]{10})

UK Postcode (from Wikipedia) (GIR 0AA|[A-PR-UWYZ]([0-9][0-9A-HJKPS-UW]?|[A-HK-Y][0-9][0-9ABEHMNPRV-Y]?) [0-9][ABD-HJLNP-UW-Z]{2})

Comments are closed.