6. Escaping smart quotes

By labourstart

A lot of the websites we’re linking to are using smart quotes — single and double.  For example, this story on the website of the AFL-CIO has several them in the headline.  When one of our correspondents simply copies and pastes the headline, this is what appears:

Don’t Miss Labor Day’s ‘Escape to the Wild’ Marathon

But once it’s in our database, it’s being shown as:

Don’t Miss Labor Day’s ‘Escape to the Wild’ Marathon

What’s the best way to change all smart quotes, single and double, automatically into characters that will not break in our new database?  Because this should be done as the records are entered, it would need to be in Perl, which is what we’re currently using for database entry.

Thanks!

14 Responses to “6. Escaping smart quotes”

  1. Jeremy John Says:

    One important thing is that you need to make sure that your database is set to UTF-8 character encoding. We ran into this problem on the US LEAP site.

    If it is set to UTF-8, email me and I’ll try to remember the other things that went wrong and how to fix them.

    Because the links are being displayed in html by php print functions, the problem shouldn’t be an escape character, but rather a database encoding problem.

    These are painful… brace yourself!

  2. Ivan Ransom Says:

    You need to select the correct character set to display the message correctly.
    The Character Encoding option will do this. Most software has “Auto-detect”. Most documents used to be “Windows” (Microsoft 1253) but these days it’s Unicode (UTF8)
    Without looking at the source code, my guess is that a switch to Unicode could fix the problem.

  3. labourstart Says:

    The page is Unicode and the database is Unicode. But the stuff being put up on websites like that of the AFL-CIO is not, and simply copying and pasting from their web page into our Unicode database is what’s causing the problem.

    What I was hoping someone would suggest is a couple of lines of Perl code to substitute Unicode-compliant characters for the ones that are breaking our page.

  4. Jason Lefkowitz Says:

    Could you run the input through HTML::Tidy and specify utf8 as the output encoding? My Perl is EXTREMELY rusty but Tidy is a good Swiss Army knife for these kinds of things…

  5. david reese Says:

    i think the problem is m$ word. its ’smart quotes’, aren’t, when it comes to the web. especially when you’ve got a non-US audience with unknown character set/font abilities in their browsers, you don’t want to get into trying to replace those characters with UTF-8 correct ’smart’ quotes. you just want to remove them.

    http://www.fourmilab.ch/webtools/demoroniser/ probably includes all the character codes you need — you could probably just strip out part of the code.

  6. david reese Says:

    and PLEASE turn off the snap preview. there’s an option in wordpress. others have made the argument already… just google ’snap preview dumb’, ’snap preview irritating’, http://snapsucks.org/, etc.

    [ feel free to delete this comment... after you've removed snap :) ]

  7. david reese Says:

    Hmm, spoke too soon about the quotes being from M$ Word. Missed the part about copying and pasting from the web… it used to be just a problem with Word, i guess now it’s.

    I think the fix still applies, they’re still unencoded 8-bit characters. of course, you could use a library to smarten the characters again on the way out (to html special characters, ‘ and the like.

  8. Jason Lefkowitz Says:

    No, David, your initial instinct is correct, it is MS Word that’s the problem — it’s just that someone cut from MS Word and pasted into a Web page, and the Web page is probably using one of the Windows character encodings that preserves Word’s junk characters.

    It looks fine on the Web page, but when you try to pipe it into a system that doesn’t grok Microspeak (preferring instead a sensible system like UTF-8), you get garbage.

    That’s why I recommended Tidy, it should be able to strip out all the MS-specific characters and replace them with UTF-8 equivalents.

  9. Nima Darabi Says:

    If you need to improve your website, I am asbolutely in :)

  10. Nima Darabi Says:

    Typed fast. absolutely I meant! ;)

  11. shai Says:

    Have you tried using mysql_real_escape_string before inserting into the database? That is:

    $headline = $_POST['headline'];
    $headline = mysql_real_escape_string($headline, $db);
    mysql_query(“INSERT INTO articles(headline) VALUES (‘$headline’)”, $db);

    Good luck!

  12. Liz Says:

    The easiest solution is to copy the article into Notepad – a simple text editor – first. When you copy it from Notepad it will be in the most simple character set.

  13. buermann Says:

    MS smartquotes are, IIRC, \x93 and 94, so you just do something like:

    $string =~ s/\x93/”/g;

    But there’s a lot of pollution around besides the smart quotes, like elipses and em-dashes and who knows what. So you want something more like:

    $s =~ s/\x82/,/g;
    $s =~ s-\x83-f-g;
    $s =~ s/\x84/,,/g;
    $s =~ s/\x85/…/g;
    $s =~ s/\x88/^/g;
    $s =~ s-\x89- /-g;
    $s =~ s/\x8B/</g;
    $s =~ s/\x8C/Oe/g;
    $s =~ s/\x91/`/g;
    $s =~ s/\x92/’/g;
    $s =~ s/\x93/”/g;
    $s =~ s/\x94/”/g;
    $s =~ s/\x95/*/g;
    $s =~ s/\x96/-/g;
    $s =~ s/\x97/–/g;
    $s =~ s-\x98-~-g;
    $s =~ s-\x99-TM-g;
    $s =~ s/\x9B/>/g;
    $s =~ s/\x9C/oe/g;

    from http://www.fourmilab.ch/webtools/demoroniser/

  14. buermann Says:

    Oh, I misread the same way as david.

    The AFL page is utf8, but somewhere along the line you’re getting a multibyte character sequence for a curly quote translated into an 8 bit character set, and you end up with three unicode characters for every high order utf8 character you started with.

    You can discover the hex codes for the resulting putrified gobbledygook with

    unpack(“H*”, $garbled_character)

    Then build a regex replacement sequence from those, if you want to bruteforce it. When I cut and paste that curly quote (utf8 e28099) into a Western Mac ISO file and convert back to utf8 I end up with this sequence:

    /\x{c3}\x{a2}\x{c2}\x{99}/

    It would be better to just find out what step is not in utf8 and fix that. In Firefox you can try going to the View->Text Encoding menu, and check if it’s defaulting to Latin1 when it should be UTF8. In the perl code you can check if the text you’re getting from the form is UTF8. It’s also probably not impossible that DBI could be munging it up, or your users might not have utf8 compatible tools.

Leave a Reply