Ashley Sheridan​.co.uk

Remove Rubbish Microsoft Markup

Posted on

This is an amendment to the function I wrote to fix the broken characters that Microsoft Office inserts into content that you paste into your CMS. The update adds an optional parameter to the function that can strip out the MS-specific styles that are only recognised by IE and cause display issues in other browsers.

The problems caused by Microsoft because of this have been quite large, and have led to many extra hours of work, to repeatedly manually fix content inserted by a CMS rich-text editor. It has caused content to not display in Firefox or the FCKEditor that was used to insert it. Here is a rough first effort to clean up the code before it's inserted into the database or file.

function removeMSCrap($crap, $richText=false) { $find = Array(chr(128), chr(133), chr(8226), chr(145), chr(8217), chr(146), chr(8220), chr(147), chr(8221), chr(148), chr(8226), chr(149), chr(8211), chr(150), chr(8212), chr(151), chr(153), chr(169), chr(174)); $replace = Array("&euro;", "&#133;", "&#8243;", "&#039;", "&#039;", "&#039;", "&#039;", "&#034;", "&#034;", "&#034;", "&#034;", "&#149;", "&#149;", "&#150;", "&#150;", "&#151;", "&#153;", "&copy;", "&reg;"); $roses = str_replace($find, $replace, $crap); if($richText) { $find = '/\<meta (\n|.)+\<\/meta\>/i'; $roses = preg_replace($find, '', $roses); $roses = str_replace("</meta>", "", $roses); } return $roses; }

This is generally the same as the last function I wrote with the addition of the final preg_replace() call, which uses a non-greedy regular expression to remove any meta tags and their contents. For this reason, you should use this function only on excerpts of text, not the whole HTML page, as you don't want to remove all the content in your <head> tags!

Because the expression is non-greedy, there's a final str_replace() call to clean up any straggling </meta> tags.

The content is safe to remove simply because meta tags just don't belong in the <body>, which is probably why Microsoft decided it was a good idea to put them there for it's own software to make use of.