php - The Actual Unicode Characters automatically converted to Numeric values using DOMDocument->saveHTML() -


i using following function inner html of html string

function dominnerhtml($element)  {      $innerhtml = "";      $children = $element->childnodes;      foreach ($children $child)      {          $tmp_dom = new domdocument('1.0', 'utf-8');         $tmp_dom->appendchild($tmp_dom->importnode($child, true));          $innerhtml .= trim($tmp_dom->savehtml());      }      return $innerhtml;  }  

my html string contains unicode character. here example of html string

$html = '<div>thats true. yes defined آپ مجھے تم کہہ کر پکاریں</div>'; 

when use above function

$output = dominnerhtml($html); 

the output below

$output = '<div>thats true. yes defined  &#1705;&#1746;&#1748;&#1587;&#1604;&#1591;&#1575</div>'; 

the actual unicode characters converted numeric values.

i have debugged code , found in dominnerhtml function before following line

$innerhtml .= trim($tmp_dom->savehtml());  

if echo

echo $tmp_dom->textcontent; 

it shows actual unicode characters after saving $innerhtml outputs numeric symbols. why doing that.

note: please don't suggest me html_entity_decode functions convert numeric symbols real unicode characters because, have user formatted data in html string, don't want convert.

note: have tried putting

<meta http-equiv="content-type" content="text/html; charset=utf-8"> 

before html string no difference.

good question, , did excellent job narrowing down problem single line of code caused things go haywire! allowed me figure out going wrong.

the problem domdocument's savehtml() function. doing supposed do, it's design not wanted.

savehtml() converts document string "using html formatting" - means html entity encoding you! sadly, not wanted. comments in php docs indicate domdocument not handle utf-8 , not fragments (as automatically adds html, doctype, etc).

check out comment proposed solution using class: alternative domdocument

after seeing many complaints domdocument shortcomings, such bad handling of encodings , saving html fragments , , , doctype, decided better solution needed.

so here is: smartdomdocument. can find @ http://beerpla.net/projects/smartdomdocument/

currently, main highlights are:

  • smartdomdocument inherits domdocument, it's easy use - declare object of type smartdomdocument instead of domdocument , enjoy new behavior on top of existing functionality (see example below).

  • savehtmlexact() - domdocument has extremely badly designed "feature" if html code loading not contain , tags, adds them automatically (yup, there no flags turn behavior off). thus, when call $doc->savehtml(), newly saved content has , doctype in it. not handy when trying work code fragments (xml has similar problem). smartdomdocument contains new function called savehtmlexact() want - saves html without adding garbage domdocument does.

  • encoding fix - domdocument notoriously doesn't handle encoding (at least utf-8) correctly , garbles output. smartdomdocument tries work around problem enhancing loadhtml() deal encoding correctly. behavior transparent - use loadhtml() normally.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -