I have problems with some odd HTML entities that comes from a XML file that I have to parse in PHP 5.6.
Some of the HTML entities are:
&lstroke;
n´
a&hook;
e&hook;
The XML comes from CAB Abstracts (http://ift.tt/1doDHIq) and its header is:
<?xml version="1.0" encoding="ISO-8859-1"?>
However, I have tried several encoding systems without success. Also, I have tried using them directly in HTML files, writing them from PHP 5.6 using html_entity_decode like this:
$strings = array('Świa&hook;tek', 'Kie&lstroke;kiewicz', 'Zagdan´ska', 'Mie&hook;tkiewski');
foreach ($strings as $s) {
foreach (array(
'ISO-8859-1', 'ISO-8859-5', 'ISO-8859-15', 'UTF-8',
'cp866', 'cp1251', 'cp1252', 'KOI8-R', 'BIG5', 'GB2312',
'BIG5-HKSCS', 'Shift_JIS', 'EUC-JP', 'MacRoman', '') as $l) {
print $l . ' ==> ';
print html_entity_decode($s, ENT_COMPAT | ENT_QUOTES | ENT_XML1 | ENT_XHTML | ENT_HTML5, $l) . '<br>';
}
}
Nothing works!!
I would like to avoid any kind of solution that include parsing the XML file replacing these entities by the right UTF-8 character. I can not foreseen when odd HTML entities like these will be included and the files are relatively big.
The string should look like these:
Świątek
Kiełkiewicz
Zagdańska
Miętkiewski
So, the question is:
How can I decode this odd HTML entities to UTF-8 in PHP?
Aucun commentaire:
Enregistrer un commentaire