Skip to content

DOMDocument::saveHTML() adds additional symbols before some UTF-8 characters in HTML #18878

Closed as not planned
@MurzNN

Description

@MurzNN

Description

The following code:

<?php
$html='<p>My Brand®</p>';
$dom = new \DOMDocument();
$dom->encoding = 'UTF-8';
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$result = $dom->saveHTML();
print($result);

Resulted in this output:

<p>My Brand&Acirc;&reg;</p>

But I expected this output instead:

<p>My Brand®</p>

Or at least this:

<p>My Brand&reg;</p>

The same issue is with other unicode characters, more examples:

  • <p>My ☆ Brand</p> >>> <p>My &acirc;&#152;&#134; Brand</p>
  • <div>€ 100</div> >>> <div>&acirc;&#130;&not; 100</div>
  • <p>À bientôt!</p> >>> <p>&Atilde;&#128; bient&Atilde;&acute;t!</p>
  • <p>Hello 😕 there!</p> >>> <p>Hello &eth;&#159;&#152;&#149; there!</p>

Could you please explain why this happens, and how to fix this issue? And maybe suggest any workarounds?

Also, as I see, it forcibly converts all unicode characters to HTML entities, but many people prefer to keep the original formatting (keep HTML entities as entities, but UTF-8 symbols as symbols), would be good to fix this too.

PHP Version

PHP 8.4.5 (cli) (built: Mar 17 2025 20:35:32) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.5, Copyright (c) Zend Technologies
    with Zend OPcache v8.4.5, Copyright (c), by Zend Technologies

Operating System

Ubuntu 25.04

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions