Skip to content

Repeatable unescaping of html content leads to not valid html #5

@Joyfolk

Description

@Joyfolk

$data = html_entity_decode($data);

This line leads to invalid HTML for some documents (for example for /edsapi/rest/Retrieve?an=T115986&dbid=dmp) because of double decoding of HTML content (&amp;lt; becomes < inside HTML body).

Looks like there is no reason to decode HTML content here - it is already decoded inside SimpleXML object. The only thing left to decode is the content of the <ephtml> tags which is double encoded.
So, this line should probably be something like this:

$data = preg_replace_callback('/<ephtml>(.*?)<\/ephtml>/m', function($escaped) {
            return html_entity_decode($escaped[0]);
}, $data);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions