Microsoft Word Characters ruining your day?
Recently we were creating some html files from some content one of our clients sent over in a ms word document. I bet a lot of you are familiar with the fact that word uses non-websafe characters such as the classic m-dash or unconventional apostrophes/single quotes. Below is an easy to use script that will clean out some of the most common ms word characters that do not behave well on the web. I am aware you can change some character encoding settings, but in most cases do you really need to have support for the crazy m-dash. I prefer to strip them out.
[code]
$in_file = '2008_06_nl.htm';
$out_file = '2008_06_nl_clean.htm';
$map = array(
128 => '€' // Euro symbol
,133 => "..." // ellipses ...
,145 => "'" // single quote left
,146 => "'" // single quote right
,147 => '"' // double quote left
,148 => '"' // double quote right
,150 => '–' // n-dash
,151 => '—' // m-dash
,153 => "" // trademark
,169 => "©" // copyright mark
,174 => "®" // registration mark
,8211 => "–" // n-dash
,8212 => "—" // m-dash
,8226 => '"' // double prime
);
/*
// test to show what it will be changing
foreach($map as $k => $v) {
echo chr($k) .' = '. $v ."
\n";
}
exit;
*/
$contents = file_get_contents($in_file)
or die('could not read input file');
foreach($map as $k => $v) {
$contents = _replace(chr($k), $v, $contents);
}
file_put_contents($out_file,$contents)
or die('could not write output file');
?>
[/code]
You just set the input file and the output file, then let the script clean things up for you. No more will you be confronted with the little question mark of character uncertainty.