A lot of the websites we’re linking to are using smart quotes — single and double. For example, this story on the website of the AFL-CIO has several them in the headline. When one of our correspondents simply copies and pastes the headline, this is what appears:
Don’t Miss Labor Day’s ‘Escape to the Wild’ Marathon
But once it’s in our database, it’s being shown as:
Don’t Miss Labor Day’s ‘Escape to the Wild’ Marathon
What’s the best way to change all smart quotes, single and double, automatically into characters that will not break in our new database? Because this should be done as the records are entered, it would need to be in Perl, which is what we’re currently using for database entry.
Thanks!
3 September 2008 at 8:50 pm |
One important thing is that you need to make sure that your database is set to UTF-8 character encoding. We ran into this problem on the US LEAP site.
If it is set to UTF-8, email me and I’ll try to remember the other things that went wrong and how to fix them.
Because the links are being displayed in html by php print functions, the problem shouldn’t be an escape character, but rather a database encoding problem.
These are painful… brace yourself!
4 September 2008 at 1:21 am |
You need to select the correct character set to display the message correctly.
The Character Encoding option will do this. Most software has “Auto-detect”. Most documents used to be “Windows” (Microsoft 1253) but these days it’s Unicode (UTF8)
Without looking at the source code, my guess is that a switch to Unicode could fix the problem.
5 September 2008 at 8:47 am |
The page is Unicode and the database is Unicode. But the stuff being put up on websites like that of the AFL-CIO is not, and simply copying and pasting from their web page into our Unicode database is what’s causing the problem.
What I was hoping someone would suggest is a couple of lines of Perl code to substitute Unicode-compliant characters for the ones that are breaking our page.
5 September 2008 at 2:45 pm |
Could you run the input through HTML::Tidy and specify utf8 as the output encoding? My Perl is EXTREMELY rusty but Tidy is a good Swiss Army knife for these kinds of things…
5 September 2008 at 2:47 pm |
i think the problem is m$ word. its ’smart quotes’, aren’t, when it comes to the web. especially when you’ve got a non-US audience with unknown character set/font abilities in their browsers, you don’t want to get into trying to replace those characters with UTF-8 correct ’smart’ quotes. you just want to remove them.
http://www.fourmilab.ch/webtools/demoroniser/ probably includes all the character codes you need — you could probably just strip out part of the code.
5 September 2008 at 2:52 pm |
and PLEASE turn off the snap preview. there’s an option in wordpress. others have made the argument already… just google ’snap preview dumb’, ’snap preview irritating’, http://snapsucks.org/, etc.
[ feel free to delete this comment... after you've removed snap
]
5 September 2008 at 3:00 pm |
Hmm, spoke too soon about the quotes being from M$ Word. Missed the part about copying and pasting from the web… it used to be just a problem with Word, i guess now it’s.
I think the fix still applies, they’re still unencoded 8-bit characters. of course, you could use a library to smarten the characters again on the way out (to html special characters, ‘ and the like.
5 September 2008 at 3:20 pm |
No, David, your initial instinct is correct, it is MS Word that’s the problem — it’s just that someone cut from MS Word and pasted into a Web page, and the Web page is probably using one of the Windows character encodings that preserves Word’s junk characters.
It looks fine on the Web page, but when you try to pipe it into a system that doesn’t grok Microspeak (preferring instead a sensible system like UTF-8), you get garbage.
That’s why I recommended Tidy, it should be able to strip out all the MS-specific characters and replace them with UTF-8 equivalents.
5 September 2008 at 3:34 pm |
If you need to improve your website, I am asbolutely in
5 September 2008 at 3:34 pm |
Typed fast. absolutely I meant!
5 September 2008 at 3:41 pm |
Have you tried using mysql_real_escape_string before inserting into the database? That is:
$headline = $_POST['headline'];
$headline = mysql_real_escape_string($headline, $db);
mysql_query(“INSERT INTO articles(headline) VALUES (‘$headline’)”, $db);
Good luck!
5 September 2008 at 5:29 pm |
The easiest solution is to copy the article into Notepad – a simple text editor – first. When you copy it from Notepad it will be in the most simple character set.
5 September 2008 at 6:15 pm |
MS smartquotes are, IIRC, \x93 and 94, so you just do something like:
$string =~ s/\x93/”/g;
But there’s a lot of pollution around besides the smart quotes, like elipses and em-dashes and who knows what. So you want something more like:
$s =~ s/\x82/,/g;
$s =~ s-\x83-f-g;
$s =~ s/\x84/,,/g;
$s =~ s/\x85/…/g;
$s =~ s/\x88/^/g;
$s =~ s-\x89- /-g;
$s =~ s/\x8B/</g;
$s =~ s/\x8C/Oe/g;
$s =~ s/\x91/`/g;
$s =~ s/\x92/’/g;
$s =~ s/\x93/”/g;
$s =~ s/\x94/”/g;
$s =~ s/\x95/*/g;
$s =~ s/\x96/-/g;
$s =~ s/\x97/–/g;
$s =~ s-\x98-~-g;
$s =~ s-\x99-TM-g;
$s =~ s/\x9B/>/g;
$s =~ s/\x9C/oe/g;
from http://www.fourmilab.ch/webtools/demoroniser/
7 September 2008 at 1:27 am |
Oh, I misread the same way as david.
The AFL page is utf8, but somewhere along the line you’re getting a multibyte character sequence for a curly quote translated into an 8 bit character set, and you end up with three unicode characters for every high order utf8 character you started with.
You can discover the hex codes for the resulting putrified gobbledygook with
unpack(“H*”, $garbled_character)
Then build a regex replacement sequence from those, if you want to bruteforce it. When I cut and paste that curly quote (utf8 e28099) into a Western Mac ISO file and convert back to utf8 I end up with this sequence:
/\x{c3}\x{a2}\x{c2}\x{99}/
It would be better to just find out what step is not in utf8 and fix that. In Firefox you can try going to the View->Text Encoding menu, and check if it’s defaulting to Latin1 when it should be UTF8. In the perl code you can check if the text you’re getting from the form is UTF8. It’s also probably not impossible that DBI could be munging it up, or your users might not have utf8 compatible tools.