Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing MediaInfo fails on Chinese chars in XML #92

Closed
ghost opened this issue Dec 17, 2019 · 12 comments
Closed

Parsing MediaInfo fails on Chinese chars in XML #92

ghost opened this issue Dec 17, 2019 · 12 comments
Labels

Comments

@ghost
Copy link

ghost commented Dec 17, 2019

In the following XML between the tags there are some Chinese chars. SimpleXML doesn't seem to like those and crashes the process.

<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>

   ErrorException  : simplexml_load_string(): Entity: line 54: parser error : Char 0xFFFE out of allowed range

  at /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
    14|         if (mb_detect_encoding($xmlString, 'UTF-8', true) === false) {
    15|             $xmlString = utf8_encode($xmlString);
    16|         }
    17|
  > 18|         $xml = simplexml_load_string($xmlString);
    19|         $json = json_encode($xml);
    20|
    21|         return json_decode($json, true);
    22|     }

  Exception trace:

  1   simplexml_load_string("<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="19.09">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Count_of_video_streams>1</Count_of_video_streams>
<Count_of_audio_streams>1</Count_of_audio_streams>
<Video_Format_List>VC-1</Video_Format_List>
<Video_Format_WithHint_List>VC-1 (WMV3)</Video_Format_WithHint_List>
<Codecs_Video>VC-1</Codecs_Video>
<Audio_Format_List>WMA</Audio_Format_List>
<Audio_Format_WithHint_List>WMA</Audio_Format_WithHint_List>
<Audio_codecs>WMA</Audio_codecs>
<Complete_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6/782247_ohrly-rh131aaso.wmv</Complete_name>
<Folder_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6</Folder_name>
<File_name_extension>782247_ohrly-rh131aaso.wmv</File_name_extension>
<File_name>782247_ohrly-rh131aaso</File_name>
<File_extension>wmv</File_extension>
<Format>Windows Media</Format>
<Format>Windows Media</Format>
<Format_Extensions_usually_used>asf dvr-ms wma wmv</Format_Extensions_usually_used>
<Commercial_name>Windows Media</Commercial_name>
<Internet_media_type>video/x-ms-wmv</Internet_media_type>
<File_size>760169</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742.4 KiB</File_size>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Overall_bit_rate>12435</Overall_bit_rate>
<Overall_bit_rate>12.4 kb/s</Overall_bit_rate>
<Maximum_Overall_bit_rate>5136894</Maximum_Overall_bit_rate>
<Maximum_Overall_bit_rate>5 137 kb/s</Maximum_Overall_bit_rate>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 FPS</Frame_rate>
<Frame_count>14657</Frame_count>
<HeaderSize>1046</HeaderSize>
<DataSize>759123</DataSize>
<Performer>Ron Harris</Performer>
<Encoded_date>UTC 2012-05-14 00:53:44.000</Encoded_date>
<File_last_modification_date>UTC 2019-12-17 17:20:55</File_last_modification_date>
<File_last_modification_date__local_>2019-12-17 18:20:55</File_last_modification_date__local_>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
<Comment>HD Videos</Comment>
</track>
<track type="Video">
<Count>377</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Video</Kind_of_stream>
<Kind_of_stream>Video</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<ID>1</ID>
<Format>VC-1</Format>
<Format>VC-1</Format>
<Commercial_name>VC-1</Commercial_name>
<Format_profile>Main</Format_profile>
<Internet_media_type>video/vc1</Internet_media_type>
<Codec_ID>WMV3</Codec_ID>
<Codec_ID_Info>Windows Media Video 9</Codec_ID_Info>
<Codec_ID_Hint>WMV3</Codec_ID_Hint>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Video 9 - 2-pass VBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Bit_rate>5000000</Bit_rate>
<Bit_rate>5 000 kb/s</Bit_rate>
<Width>1920</Width>
<Width>1 920 pixels</Width>
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (29970/1000) FPS</Frame_rate>
<FrameRate_Num>29970</FrameRate_Num>
<FrameRate_Den>1000</FrameRate_Den>
<Frame_count>14657</Frame_count>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Compression_mode>Lossy</Compression_mode>
<Compression_mode>Lossy</Compression_mode>
<Bits__Pixel_Frame_>0.080</Bits__Pixel_Frame_>
<Stream_size>305660000</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>291.5 MiB</Stream_size>
</track>
<track type="Audio">
<Count>280</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>WMA</Format>
<Format>WMA</Format>
<Commercial_name>WMA</Commercial_name>
<Format_version>Version 2</Format_version>
<Codec_ID>161</Codec_ID>
<Codec_ID_Info>Windows Media Audio</Codec_ID_Info>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Audio 9 - 128 kbps, 44 kHz, stereo CBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09.056</Duration>
<Bit_rate>128000</Bit_rate>
<Bit_rate>128 kb/s</Bit_rate>
<Channel_s_>2</Channel_s_>
<Channel_s_>2 channels</Channel_s_>
<Sampling_rate>44100</Sampling_rate>
<Sampling_rate>44.1 kHz</Sampling_rate>
<Samples_count>21567370</Samples_count>
<Bit_depth>16</Bit_depth>
<Bit_depth>16 bits</Bit_depth>
<Stream_size>7824896</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7 MiB</Stream_size>
<Stream_size>7.5 MiB</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7.462 MiB</Stream_size>
</track>
</File>
</Mediainfo>

")
      /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18

  2   Mhor\MediaInfo\Parser\AbstractXmlOutputParser::transformXmlToArray("<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="19.09">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Count_of_video_streams>1</Count_of_video_streams>
<Count_of_audio_streams>1</Count_of_audio_streams>
<Video_Format_List>VC-1</Video_Format_List>
<Video_Format_WithHint_List>VC-1 (WMV3)</Video_Format_WithHint_List>
<Codecs_Video>VC-1</Codecs_Video>
<Audio_Format_List>WMA</Audio_Format_List>
<Audio_Format_WithHint_List>WMA</Audio_Format_WithHint_List>
<Audio_codecs>WMA</Audio_codecs>
<Complete_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6/782247_ohrly-rh131aaso.wmv</Complete_name>
<Folder_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6</Folder_name>
<File_name_extension>782247_ohrly-rh131aaso.wmv</File_name_extension>
<File_name>782247_ohrly-rh131aaso</File_name>
<File_extension>wmv</File_extension>
<Format>Windows Media</Format>
<Format>Windows Media</Format>
<Format_Extensions_usually_used>asf dvr-ms wma wmv</Format_Extensions_usually_used>
<Commercial_name>Windows Media</Commercial_name>
<Internet_media_type>video/x-ms-wmv</Internet_media_type>
<File_size>760169</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742.4 KiB</File_size>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Overall_bit_rate>12435</Overall_bit_rate>
<Overall_bit_rate>12.4 kb/s</Overall_bit_rate>
<Maximum_Overall_bit_rate>5136894</Maximum_Overall_bit_rate>
<Maximum_Overall_bit_rate>5 137 kb/s</Maximum_Overall_bit_rate>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 FPS</Frame_rate>
<Frame_count>14657</Frame_count>
<HeaderSize>1046</HeaderSize>
<DataSize>759123</DataSize>
<Performer>Ron Harris</Performer>
<Encoded_date>UTC 2012-05-14 00:53:44.000</Encoded_date>
<File_last_modification_date>UTC 2019-12-17 17:20:55</File_last_modification_date>
<File_last_modification_date__local_>2019-12-17 18:20:55</File_last_modification_date__local_>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
<Comment>HD Videos</Comment>
</track>
<track type="Video">
<Count>377</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Video</Kind_of_stream>
<Kind_of_stream>Video</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<ID>1</ID>
<Format>VC-1</Format>
<Format>VC-1</Format>
<Commercial_name>VC-1</Commercial_name>
<Format_profile>Main</Format_profile>
<Internet_media_type>video/vc1</Internet_media_type>
<Codec_ID>WMV3</Codec_ID>
<Codec_ID_Info>Windows Media Video 9</Codec_ID_Info>
<Codec_ID_Hint>WMV3</Codec_ID_Hint>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Video 9 - 2-pass VBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Bit_rate>5000000</Bit_rate>
<Bit_rate>5 000 kb/s</Bit_rate>
<Width>1920</Width>
<Width>1 920 pixels</Width>
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (29970/1000) FPS</Frame_rate>
<FrameRate_Num>29970</FrameRate_Num>
<FrameRate_Den>1000</FrameRate_Den>
<Frame_count>14657</Frame_count>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Compression_mode>Lossy</Compression_mode>
<Compression_mode>Lossy</Compression_mode>
<Bits__Pixel_Frame_>0.080</Bits__Pixel_Frame_>
<Stream_size>305660000</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>291.5 MiB</Stream_size>
</track>
<track type="Audio">
<Count>280</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>WMA</Format>
<Format>WMA</Format>
<Commercial_name>WMA</Commercial_name>
<Format_version>Version 2</Format_version>
<Codec_ID>161</Codec_ID>
<Codec_ID_Info>Windows Media Audio</Codec_ID_Info>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Audio 9 - 128 kbps, 44 kHz, stereo CBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09.056</Duration>
<Bit_rate>128000</Bit_rate>
<Bit_rate>128 kb/s</Bit_rate>
<Channel_s_>2</Channel_s_>
<Channel_s_>2 channels</Channel_s_>
<Sampling_rate>44100</Sampling_rate>
<Sampling_rate>44.1 kHz</Sampling_rate>
<Samples_count>21567370</Samples_count>
<Bit_depth>16</Bit_depth>
<Bit_depth>16 bits</Bit_depth>
<Stream_size>7824896</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7 MiB</Stream_size>
<Stream_size>7.5 MiB</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7.462 MiB</Stream_size>
</track>
</File>
</Mediainfo>

")
      /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/MediaInfoOutputParser.php:22

  Please use the argument -v to see more details.
@mhor
Copy link
Owner

mhor commented Dec 18, 2019

@Fossil01 Thanks for reporting this issue. An old pull request consider removing utf8_encode to solve a "bug".
Could you try to remove this call and see if that solve the problem ? If not I will try to fix this issue this weekend.

@mhor mhor added the bug label Dec 18, 2019
@ghost
Copy link
Author

ghost commented Dec 18, 2019

@mhor nope same thing happens if I remove those 3 lines.

@mhor
Copy link
Owner

mhor commented Dec 18, 2019

Thanks for your quick answer, so it's definitively related to xml string returned by mediainfo.
This is looking as an acceptable solution for me, I will try to implement this as soon as possible but if you want feel free to open a pull request with your solution I will be happy to review it.

@ghost
Copy link
Author

ghost commented Dec 25, 2019

I'll have a crack at it after Christmas. Cheers.

@mhor
Copy link
Owner

mhor commented Jan 23, 2020

@Fossil01 did you have test the fix I've done (PR #93) ?

@ghost
Copy link
Author

ghost commented Apr 12, 2020

Completely forgot about this. It seems to work now.

@ghost
Copy link
Author

ghost commented Oct 13, 2021

Looks like I am still having this issue.

ErrorException

  simplexml_load_string(): Entity: line 10: parser error : Char 0xFFFE out of allowed range

  at vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
    14|         if (mb_detect_encoding($xmlString, 'UTF-8', true) === false) {
    15|             $xmlString = utf8_encode($xmlString);
    16|         }
    17|
  > 18|         $xml = simplexml_load_string($xmlString);
    19|         $json = json_encode($xml);
    20|
    21|         return json_decode($json, true);

Maybe we can use a function like this to strip out invalid chars:
https://stackoverflow.com/a/3466049

@ghost ghost reopened this Oct 13, 2021
@ghost
Copy link
Author

ghost commented Oct 13, 2021

Aha. It looks like aca1198 never made it into the master branch and thus in a release.

When I add these lines it seems to fix the issue too:

$xmlString = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
    "\xEF\xBF\xBD",
    $xmlString
);

@ghost
Copy link
Author

ghost commented Oct 13, 2021

XML it fails on currently:

<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="20.03">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Complete_name>/mnt/ramdisk/1/15f4594a-c211-4acc-9f58-cae2b09c8151/160095_[� Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D].mkv</Complete_name>
<Folder_name>/mnt/ramdisk/1/15f4594a-c211-4acc-9f58-cae2b09c8151</Folder_name>
<File_name_extension>160095_[� Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D].mkv</File_name_extension>
<File_name>160095_[� Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D]</File_name>
<File_extension>mkv</File_extension>
<File_size>1048394</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 023.8 KiB</File_size>
<Stream_size>1048394</Stream_size>
<Stream_size>1 024 KiB (100%)</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 023.8 KiB</Stream_size>
<Stream_size>1 024 KiB (100%)</Stream_size>
<Proportion_of_this_stream>1.00000</Proportion_of_this_stream>
<File_last_modification_date>UTC 2021-10-13 08:47:31</File_last_modification_date>
<File_last_modification_date__local_>2021-10-13 10:47:31</File_last_modification_date__local_>
</track>
</File>
</Mediainfo>

@mhor
Copy link
Owner

mhor commented Oct 24, 2021

@Fossil01 Oops I don't know why I've never merge #93. I've open a new PR (#128). Could you check if it fix the bug.

@ghost
Copy link
Author

ghost commented Oct 25, 2021

I'll have a look this week, thanks. In the mean time I manually edited the file in the vendor dir and added that preg_replace I pasted here before as an ugly temp fix :-)

@mhor
Copy link
Owner

mhor commented Jun 20, 2023

Closed for now, due to inactivity.

@mhor mhor closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant