Wednesday 19 November 2014

Using Regular Expressions with PHP

PHP provides rich support for regular expression. Regular expressions or RegEx can be used for pattern matching, replacing a particular part of string or to extract some part of string.

RegEx are string of characters that defines a particular patter and has its own rules.
There are two types of RegEx :
  • POSIX Regular Expressions
  • PERL Style Regular Expressions

We will see only POSIX style RegEx in this tutorial.

What is a Regular Expression

RegEx is a string of character. For example, a is a regex, \"([^\"]+)\" is also a regex and so is [0-9]+([a-z]).* In the first sight it looks very weird but as we go along the tutorial it will become easy for you to understand these patterns.

Matching with literals

Literals matches exact characters they specify.For example, "/abc/" is a regular expression. It will match strings which has string "abc" as a sub string, like abcdef, xyzabc, xyabcef etc. The forward slashes at the begin and end are called delimiters. They mark the start and end of pattern. They should be same and can not be backslash or any alphanumeric character that is you can use /,|,: etc.

Matching start and end

Consider the pattern "/abc/". As we saw it will match "abcdef", "xyzabcdef". Suppose we want that "abc" should only come at the beginning, that is we dont want to match "xyzabcdef". We can use ^ for the purpose. Anything that comes after ^ should come in the beginning of the subject string. Thus "/^abc/" will only match "abcdef" not "xyzabcdef".
Just like ^, $ matched end on the string. So "/abc$/" will match neither "abcdef" nor "xyzabcdef", but it will match "defabc".
Some particular examples :
"/^$/" will match empty string
"/^abc$/" will match only "abc", i.e. none of "abcdef", "xyzzbcdef", "defabc" is matched, only "abc" gets matched.

Giving Range with brackets

Brackets [] can be used in a regex to specify a range. For example, [0-9] matches single digit from 0 to 9. Consider [a-z] which matched any lower case alphabets. Consider the pattern, "/^[a-z][a-z][0-9][0-9]/" it will match any string starting with a small case alphabet and followed by a small alphabet and two digits. So it will match "aa10", "xy44"; but not "12fv","ddrt", "1123". The ^ character when used at starting of pattern it will indicate start of the subject string. But inside the [] it has a special purpose of negation. For example "/^[^0-9]/" will match any string that DOES NOT start with a number. Here first ^ marks the beginning of the string while the second one inside the brackets gives negation.

Giving choices

Suppose we want to match a pattern where first character is either a digit or a alphabet and followed by two digits. From above examples a simple solution would be to first check "/^[a-z][0-9][0-9]/" and if it does not match we check for "/^[0-9][0-9][0-9]/". But this is not a good solution as you have to write case for each choice that is possible. For example consider date-month-year pattern, where date can 0 followed by a digit or 1 followed by a digit or 2 followed by digit or 3 followed by either 1 or 0; month can be 0 followed by a digit or 1 followed by 0 or 1 or 2; year is any two digits. If we use above method and write code it will be really cumbersome to write and prone to error. Fortunately RegEx provides | symbol for making choice. Consider our first example, if want a alphabet OR digit followed by two digits, our patter would be "/^[a-z]|[0-9][0-9][0-9]/". | serves as OR in patters. Remember | needs patterns on both side. Also pattern "/a|bc/" will match (a OR b) and then c; not a OR (b and then c). We can use parenthesis for easy reading like: "/(a|b)c/".

Using Quantifiers

Quantifiers are used to match long repeating string of pattens. For example, assume that we want to match a string containing only numbers. It is not possible to do directly using any of above features. For such kind of situations Quantifiers are provided.
They are +,*,?, {},^,$. We have already seen ^ and $. Here is a short explanation of rest.

Quantifier Use
* Matched zero or more occurrence of preceding pattern.
+ Matched one or more occurrence of preceding pattern.
? Matched zero or one occurrence of preceding pattern.
{min,max} Matched occurrence of preceding pattern min to max times.
{min,} Matched occurrence of preceding pattern atleast min times.


Here are some examples:
Quantifier Use
a* Matches empty string,a,aa,aaa,aaa...
a+ Matches a,aa,aaa,aaa...
Can be thought as aa*
a? Matches empty string or a.
{2,5} Matched occurrence of preceding pattern min to max times.
{min,} Matched occurrence of preceding pattern at least min times.


Escaping

Sometimes you want to match symbols like '[', or ' / ' in the string with the string, which are actually a part of the pattern syntax. Thus, it is necessary to distinguish weather we want to use a particular symbol as an literal or as a part of RegEx. "\" (without quotes) is used for this. So if you want to match / you would actually use \/. It called escape sequence. Same goes for other symbols like, \[, \" etc.

Remembering with parenthesis

Suppose you want to extract an IP address from a line of text. IP addresses are like, 127.0.0.1, 192.168.2.6. They are 4 numbers separated by dots and say we want all four numbers separately. From what we have learned its hard to do this. However with parenthesis at our help this becomes really easy. Pattern "/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/" will match the IP address. Now if you want to extract some part of matched string from pattern, you can parenthesis that part and the use references to it. Thus, "/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/" can be used to remember four numbers of IP address that can be accessed later. We will see, how to reference them when we will see PHP functions that uses RegEx. Also note that I have escaped the . because it has a special meaning as we will see next.

The Dot

There is a special symbol "." which is used for matching any one character. You can use if you don't know actual characters in the patter but you know the text, pattern or symbols bounding the required pattern. With the use of above quantifier . is really helpful in many cases. For example suppose you want to find the text in a line between two hash symbols. You don't know what text is, what its length is, it could be empty as well. We use . is situations like this. So for above scenario, "/#(.*)#/" will match #AnyText# and using parenthesis we can extract the text between hashes.

You can practice regular expressions on following site.
regex101.com

Ads :
Buy Kodak, Canon, Panasonic Cameras on www.ebay.com
Electronics, Cars, Fashion, Collectibles, Coupons and More Online Shopping | eBay
www.ebay.co.uk | www.ebay.com.my

No comments:

Post a Comment