The Spotless Developer Blog

Older Posts »« Newer Posts

Using PHP and regex to parse BBCodes

By Dustin Hendricks - January 2nd, 2011

Today we show how to use PHP and regular expressions to replace BBCodes found in text. BBCodes, or Bulletin Board Codes, are lightweight markup used to format mainly bulletin board posts. Most bulletin boards use BBCodes instead of allowing actual HTML to be input into posts, because malformed HTML can potentially make a mess of a bulletin board's layout, and even more importantly, can allow users to place malicious XSS JavaScript into HTML tags, possibly causing harm to visitors of the post.

A BBCode is a markup tag typically consisting of a keyword wrapped in square brackets.

[b]Bolded text![/b]

Which, using a regular expression, we would want to replace with this.

<strong>Bolded text!</strong>

A regular expression is an extremely versatile way of matching patterns within a string. PHP, as well as most languages, allows us to perform very complex substring replacements within a given string using regular expressions. To do this we will use the PHP preg_replace() function.

<?php
$input_string 
'foo [b]bolded text[/b] bar';
$regex '/\[b](.+?)\[\/b]/is';
$replacement_string '<strong>$1</strong>';
echo 
preg_replace($regex$replacement_string$input_string);
?>

I will try to explain the regular expression used. All special characters within the regular expression that we would like to use as a literal for matching purposes must be escaped using a back slash. This is why you see a back slash before each of the literal '[' and '/' characters that we would like to match in our pattern.

We use non-escaped forward slashes at the beginning and end of the pattern. This ecapsulates the regular expression, and any regular expression modifiers would go after the end slash.

The (.+?) is a piece of the pattern that returns its match as a value for the preg_replace() function to use within its replacement string. Each pattern wrapped in parenthesis will return a value, which can be accessed within the replacement string in sequential order using a dollar sign and an order index ($1, $2, $3, etc). This is how our replacement string is able to contain whatever contents are between the opening and closing BBCode tags.

Within the parentheses, the period will match any character, the plus sign will modify the period so that it may match one or more of any character instead of just one, and the question mark will modify the pattern to be optional. Without the question mark the pattern will be greedy, and grab as many characters as possible to match the pattern. So a regex of '/\[b](.+)\[\/b]/is' against a string of '[b]foo[/b][b]bar[/b]' would match the entire string, while a regex of '/\[b](.+?)\[\/b]/is' would match only '[b]foo[/b]'.

The last two characters of the regular expression are modifiers. The 'i' allows the regex to match without case sensitivity, and the 's' allows the periods within the expression to match more characters than they normally would such as newline characters.

More complex BBCodes such as the [url] tag can have parameters, so we need to make a slightly more complex replacement.

<?php
$input_string 
'[url=www.spotlesswebdesign.com]Spotless Web Design[/url]';
$regex '/\[url=([^"]+?)](.+?)\[\/url]/is';
$replacement_string '<a href="$1">$2</a>';
echo 
preg_replace($regex$replacement_string$input_string);
?>

Notice that in this example, we use the character class [^"] instead of a period. This character class will match any character except for double quote marks. This is used to prevent users from being able to slip unwanted JavaScript code into their anchor tag.

The last thing I would like to go over is how to use one preg_replace() to replace multiple BBCodes at once. The preg_replace function can take in multiple regexs and replacement strings as arguments if you put them into arrays. This means you can do multiple replacements with one preg_replace() function by doing the following.

<?php
$input_string 
'foo [b]bolded text[/b] bar foo [i]italicized text[/i] bar';
$regexs = array(
    
'/\[b](.+?)\[\/b]/is',
    
'/\[i](.+?)\[\/i]/is'
);
$replacement_strings = array(
    
'<strong>$1</strong>',
    
'<em>$1</em>'
);
echo 
preg_replace($regexs$replacement_strings$input_string);
?>

In this example, each regex in the regex array is replaced by its corresponding replacement string in the replacement string array.

Keep in mind that PHP has its own BBCode parsing PECL extension, which may be faster and safer than trying to create your own regular expressions to do the trick, but I leave it up to you to decide what option is right for you, because in some cases you may not have the option to install a PECL extension on your server.

Tags: #php #regex #regexp #regular-expressions #bbcodes #parsing

Older Posts »« Newer Posts