Not correctly recognizing all .docx files #103

gpeddle-teal · 2017-09-08T22:02:40Z

When I extract the file type for the sample .docx file in the unit test it works fine for me. However when I try extracting the file type for a test file I created on my Mac the library thinks it's a zip file.

UofTCSCoop.docx

kevva · 2017-09-11T10:45:29Z

In which program did you create the file? This check only recognizes files created in Microsoft Office.

kevva · 2017-09-11T10:56:51Z

Looks like it works in file though:

❯ file UofTCSCoop.docx
UofTCSCoop.docx: Microsoft Word 2007+

It fails on https://github.com/sindresorhus/file-type/blob/master/index.js#L126. @forivall do you have any idea why?

forivall · 2017-09-12T00:08:58Z

The first file in the zip archive is word/numbering.xml, and the detection here relies on the first entry being [Content_Types].xml or _rels/.rels. I think the comments that I followed along in https://github.com/file/file/blob/master/magic/Magdir/msooxml are out of date and incorrect, so I think file is actually scanning the whole file for the [Content_Types].xml or _rels/.rels file entry. Not sure.

to see for yourself, use zipfinfo to see the difference between your docx and the fixture

» zipinfo UofTCSCoop.docx
Archive:  UofTCSCoop.docx
Zip file size: 5897 bytes, number of entries: 8
-rw----     2.0 fat     1211 bl defN 17-Jan-30 15:28 word/numbering.xml
-rw----     2.0 fat     1416 bl defN 17-Jan-30 15:28 word/settings.xml
-rw----     2.0 fat     1240 bl defN 17-Jan-30 15:28 word/fontTable.xml
-rw----     2.0 fat     4999 bl defN 17-Jan-30 15:28 word/styles.xml
-rw----     2.0 fat     8006 bl defN 17-Jan-30 15:28 word/document.xml
-rw----     2.0 fat      680 bl defN 17-Jan-30 15:28 word/_rels/document.xml.rels
-rw----     2.0 fat      298 bl defN 17-Jan-30 15:28 _rels/.rels
-rw----     2.0 fat      954 bl defN 17-Jan-30 15:28 [Content_Types].xml
8 files, 18804 bytes uncompressed, 4853 bytes compressed:  74.2%

~pubrepos/file-type/fixture ooxml-redux* ⇣
» zipinfo fixture.docx
Archive:  fixture.docx
Zip file size: 22873 bytes, number of entries: 12
-rw----     4.5 fat     1364 b- defS 80-Jan-01 00:00 [Content_Types].xml
-rw----     4.5 fat      735 b- defS 80-Jan-01 00:00 _rels/.rels
-rw----     4.5 fat      817 b- defS 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw----     4.5 fat     1797 b- defS 80-Jan-01 00:00 word/document.xml
-rw----     4.5 fat     6797 b- defS 80-Jan-01 00:00 word/theme/theme1.xml
-rw----     4.5 fat    11340 b- stor 80-Jan-01 00:00 docProps/thumbnail.jpeg
-rw----     4.5 fat     2525 b- defS 80-Jan-01 00:00 word/settings.xml
-rw----     4.5 fat     1525 b- defS 80-Jan-01 00:00 word/fontTable.xml
-rw----     4.5 fat      529 b- defS 80-Jan-01 00:00 word/webSettings.xml
-rw----     4.5 fat      755 b- defS 80-Jan-01 00:00 docProps/core.xml
-rw----     4.5 fat    29435 b- defS 80-Jan-01 00:00 word/styles.xml
-rw----     4.5 fat      693 b- defS 80-Jan-01 00:00 docProps/app.xml
12 files, 58312 bytes uncompressed, 19663 bytes compressed:  66.3%

I'm guessing that the UofT doc was created with an older version of openoffice or libreoffice that doesn't replicate the ms office zip file order. or no versions of openoffice or libreoffice do that. or some other generator.

Using pandoc to generate a docx yields a properly identifiable file. I don't have open/libreoffice on my current computer to test if that is the case.

gpeddle-teal · 2017-09-12T13:21:52Z

You are correct that I didn't create the file with a standard desktop version of Microsoft Word. It's been a while since the file was created. Unfortunately I don't remember the tool but most likely some sort of open office. I have a number of other files that exhibit the same problem though, I don't think it's unique to this particular file.

forivall · 2017-09-12T19:26:35Z

The main thing is that I'd like to learn how file actually detects it, since i'd like feature parity on that.

Otherwise, we could add in alternate logic such that if the filename starts with word/ xl/ or ppt/, continue reading the file to see if [Content_Types].xml or _rels/.rels exists.

However, we probably want to keep the limit of only needing 4100 bytes to detect.

reviewher · 2017-10-25T23:11:21Z

@forivall the magic doesn't do what the comments suggest.

# make sure the first file is correct
>0x1E		regex		\\[Content_Types\\]\\.xml|_rels/\\.rels

The regex actually tests the entire file starting from offset 0x1E, not just the first entry. To prove this, here's a plain ZIP file that breaks file: bad.zip

$ unzip -l bad.zip 
Archive:  bad.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     1022  10-25-2017 18:59   word/t1.xml
        4  10-25-2017 18:58   word/t2.xml
        4  10-25-2017 18:58   word/t3.xml
        4  10-25-2017 19:04   word/t5.xml
        4  10-25-2017 19:05   WTF/_rels/.rels
---------                     -------
     1038                     5 files
$ xxd bad.zip | head
00000000: 504b 0304 0a00 0000 0800 7897 594b e1c1  PK........x.YK..
00000010: 2659 6b01 0000 fe03 0000 0b00 0000 776f  &Yk...........wo
00000020: 7264 2f74 312e 786d 6ceb 29e5 e267 6160  rd/t1.xml.)..ga`
00000030: 6068 658a 63c0 065c 9cc2 3dfd 4283 0db0  `he.c..\..=.B...
00000040: 4a42 4172 4662 9121 1e15 ce5c a8fc bcd2  JBArFb.!...\....
00000050: dcd4 a2cc 6443 8378 23ac eafd b898 50f8  ....dC.x#.....P.
00000060: b9a9 b9f9 f8ec f745 333f 293f 3f07 9f7a  .......E3?)??..z
00000070: 1f46 547e 4a62 492a 3ef5 2e1c a8fc b49c  .FT~JbI*>.......
00000080: fcc4 129c ae67 6070 4373 bfbf 8f2b 3ee3  .....g`pCs...+>.
00000090: 19dc d1dd 9f99 9758 5489 5bbd 139a fa9c  .......XT.[.....
$ file bad.zip 
bad.zip: Microsoft Word 2007+

This file is matched because the string _rels/.rels appears in the file, even if it happens to be prepended by WTF/

@sindresorhus is it necessary to replicate the file Magic logic, or would it make more sense to jump to the central directory for zip files?

andrei-yanovich · 2018-07-23T10:42:03Z

I see this issue with a docx exported from google docs. version 8.1.0

BradleyDHobbs · 2020-01-20T18:47:34Z

I am still getting this issue for word files created in Word for Office 365 desktop application.

Any ideas?

Borewit · 2020-01-20T18:52:07Z

@BradleyDHobbs, please open a new issue, and refer to this one.

In that issue, can you zip and attach a small sample file?
Please double check it does not contain any personal information.

tusbar mentioned this issue Sep 16, 2017

Complete API rewrite jdesboeufs/plunger#7

Merged

25 tasks

sindresorhus added bug help wanted labels Oct 10, 2017

reviewher mentioned this issue Jan 8, 2018

file-type treat libreoffice saved .pptx as .zip #119

Closed

sindresorhus mentioned this issue Jul 10, 2018

.xlsx from linux is accepting the file type as application/zip #155

Closed

jacor84 mentioned this issue Jul 25, 2018

Fix type detection for docx, xlsx and pptx #159

Merged

sindresorhus closed this as completed in #159 Aug 10, 2018

BradleyDHobbs mentioned this issue Jan 20, 2020

Docx files being detected as a zip #312

Closed

Borewit mentioned this issue Jan 20, 2020

End-Of-Stream error thrown parsing docx #313

Closed

npalmius mentioned this issue Mar 17, 2020

Word files (.docx) identified as application/zip #339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not correctly recognizing all .docx files #103

Not correctly recognizing all .docx files #103

gpeddle-teal commented Sep 8, 2017

kevva commented Sep 11, 2017

kevva commented Sep 11, 2017

forivall commented Sep 12, 2017 •

edited

Loading

gpeddle-teal commented Sep 12, 2017

forivall commented Sep 12, 2017

reviewher commented Oct 25, 2017

andrei-yanovich commented Jul 23, 2018

BradleyDHobbs commented Jan 20, 2020

Borewit commented Jan 20, 2020

Not correctly recognizing all .docx files #103

Not correctly recognizing all .docx files #103

Comments

gpeddle-teal commented Sep 8, 2017

kevva commented Sep 11, 2017

kevva commented Sep 11, 2017

forivall commented Sep 12, 2017 • edited Loading

gpeddle-teal commented Sep 12, 2017

forivall commented Sep 12, 2017

reviewher commented Oct 25, 2017

andrei-yanovich commented Jul 23, 2018

BradleyDHobbs commented Jan 20, 2020

Borewit commented Jan 20, 2020

forivall commented Sep 12, 2017 •

edited

Loading