samedi 9 mai 2015

Encoding odd HTML entities '&lstroke;'

I have problems with some odd HTML entities that comes from a XML file that I have to parse in PHP 5.6.

Some of the HTML entities are:

&lstroke;
n´
a&hook;
e&hook;

The XML comes from CAB Abstracts (http://ift.tt/1doDHIq) and its header is:

<?xml version="1.0" encoding="ISO-8859-1"?>

However, I have tried several encoding systems without success. Also, I have tried using them directly in HTML files, writing them from PHP 5.6 using html_entity_decode like this:

$strings = array('&Sacute;wia&hook;tek', 'Kie&lstroke;kiewicz', 'Zagdan&acute;ska', 'Mie&hook;tkiewski');

foreach ($strings as $s) {
    foreach (array(
            'ISO-8859-1', 'ISO-8859-5', 'ISO-8859-15', 'UTF-8',
            'cp866', 'cp1251', 'cp1252', 'KOI8-R', 'BIG5', 'GB2312',
            'BIG5-HKSCS', 'Shift_JIS', 'EUC-JP', 'MacRoman', '') as $l) {
        print $l . ' ==> ';
        print html_entity_decode($s, ENT_COMPAT | ENT_QUOTES | ENT_XML1 | ENT_XHTML | ENT_HTML5, $l) . '<br>';
    }
}

Nothing works!!

I would like to avoid any kind of solution that include parsing the XML file replacing these entities by the right UTF-8 character. I can not foreseen when odd HTML entities like these will be included and the files are relatively big.

The string should look like these:

Świątek
Kiełkiewicz
Zagdańska 
Miętkiewski

So, the question is:

How can I decode this odd HTML entities to UTF-8 in PHP?

Aucun commentaire:

Enregistrer un commentaire