Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not correctly recognizing all .docx files #103

Closed
gpeddle-teal opened this issue Sep 8, 2017 · 9 comments · Fixed by #159
Closed

Not correctly recognizing all .docx files #103

gpeddle-teal opened this issue Sep 8, 2017 · 9 comments · Fixed by #159

Comments

@gpeddle-teal
Copy link

When I extract the file type for the sample .docx file in the unit test it works fine for me. However when I try extracting the file type for a test file I created on my Mac the library thinks it's a zip file.

UofTCSCoop.docx

@kevva
Copy link
Contributor

kevva commented Sep 11, 2017

In which program did you create the file? This check only recognizes files created in Microsoft Office.

@kevva
Copy link
Contributor

kevva commented Sep 11, 2017

Looks like it works in file though:

❯ file UofTCSCoop.docx
UofTCSCoop.docx: Microsoft Word 2007+

It fails on https://github.com/sindresorhus/file-type/blob/master/index.js#L126. @forivall do you have any idea why?

@forivall
Copy link
Contributor

forivall commented Sep 12, 2017

The first file in the zip archive is word/numbering.xml, and the detection here relies on the first entry being [Content_Types].xml or _rels/.rels. I think the comments that I followed along in https://github.com/file/file/blob/master/magic/Magdir/msooxml are out of date and incorrect, so I think file is actually scanning the whole file for the [Content_Types].xml or _rels/.rels file entry. Not sure.

to see for yourself, use zipfinfo to see the difference between your docx and the fixture

» zipinfo UofTCSCoop.docx
Archive:  UofTCSCoop.docx
Zip file size: 5897 bytes, number of entries: 8
-rw----     2.0 fat     1211 bl defN 17-Jan-30 15:28 word/numbering.xml
-rw----     2.0 fat     1416 bl defN 17-Jan-30 15:28 word/settings.xml
-rw----     2.0 fat     1240 bl defN 17-Jan-30 15:28 word/fontTable.xml
-rw----     2.0 fat     4999 bl defN 17-Jan-30 15:28 word/styles.xml
-rw----     2.0 fat     8006 bl defN 17-Jan-30 15:28 word/document.xml
-rw----     2.0 fat      680 bl defN 17-Jan-30 15:28 word/_rels/document.xml.rels
-rw----     2.0 fat      298 bl defN 17-Jan-30 15:28 _rels/.rels
-rw----     2.0 fat      954 bl defN 17-Jan-30 15:28 [Content_Types].xml
8 files, 18804 bytes uncompressed, 4853 bytes compressed:  74.2%

~pubrepos/file-type/fixture ooxml-redux* ⇣
» zipinfo fixture.docx
Archive:  fixture.docx
Zip file size: 22873 bytes, number of entries: 12
-rw----     4.5 fat     1364 b- defS 80-Jan-01 00:00 [Content_Types].xml
-rw----     4.5 fat      735 b- defS 80-Jan-01 00:00 _rels/.rels
-rw----     4.5 fat      817 b- defS 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw----     4.5 fat     1797 b- defS 80-Jan-01 00:00 word/document.xml
-rw----     4.5 fat     6797 b- defS 80-Jan-01 00:00 word/theme/theme1.xml
-rw----     4.5 fat    11340 b- stor 80-Jan-01 00:00 docProps/thumbnail.jpeg
-rw----     4.5 fat     2525 b- defS 80-Jan-01 00:00 word/settings.xml
-rw----     4.5 fat     1525 b- defS 80-Jan-01 00:00 word/fontTable.xml
-rw----     4.5 fat      529 b- defS 80-Jan-01 00:00 word/webSettings.xml
-rw----     4.5 fat      755 b- defS 80-Jan-01 00:00 docProps/core.xml
-rw----     4.5 fat    29435 b- defS 80-Jan-01 00:00 word/styles.xml
-rw----     4.5 fat      693 b- defS 80-Jan-01 00:00 docProps/app.xml
12 files, 58312 bytes uncompressed, 19663 bytes compressed:  66.3%

I'm guessing that the UofT doc was created with an older version of openoffice or libreoffice that doesn't replicate the ms office zip file order. or no versions of openoffice or libreoffice do that. or some other generator.

Using pandoc to generate a docx yields a properly identifiable file. I don't have open/libreoffice on my current computer to test if that is the case.

@gpeddle-teal
Copy link
Author

You are correct that I didn't create the file with a standard desktop version of Microsoft Word. It's been a while since the file was created. Unfortunately I don't remember the tool but most likely some sort of open office. I have a number of other files that exhibit the same problem though, I don't think it's unique to this particular file.

@forivall
Copy link
Contributor

The main thing is that I'd like to learn how file actually detects it, since i'd like feature parity on that.

Otherwise, we could add in alternate logic such that if the filename starts with word/ xl/ or ppt/, continue reading the file to see if [Content_Types].xml or _rels/.rels exists.

However, we probably want to keep the limit of only needing 4100 bytes to detect.

@reviewher
Copy link

@forivall the magic doesn't do what the comments suggest.

# make sure the first file is correct
>0x1E		regex		\\[Content_Types\\]\\.xml|_rels/\\.rels

The regex actually tests the entire file starting from offset 0x1E, not just the first entry. To prove this, here's a plain ZIP file that breaks file: bad.zip

$ unzip -l bad.zip 
Archive:  bad.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     1022  10-25-2017 18:59   word/t1.xml
        4  10-25-2017 18:58   word/t2.xml
        4  10-25-2017 18:58   word/t3.xml
        4  10-25-2017 19:04   word/t5.xml
        4  10-25-2017 19:05   WTF/_rels/.rels
---------                     -------
     1038                     5 files
$ xxd bad.zip | head
00000000: 504b 0304 0a00 0000 0800 7897 594b e1c1  PK........x.YK..
00000010: 2659 6b01 0000 fe03 0000 0b00 0000 776f  &Yk...........wo
00000020: 7264 2f74 312e 786d 6ceb 29e5 e267 6160  rd/t1.xml.)..ga`
00000030: 6068 658a 63c0 065c 9cc2 3dfd 4283 0db0  `he.c..\..=.B...
00000040: 4a42 4172 4662 9121 1e15 ce5c a8fc bcd2  JBArFb.!...\....
00000050: dcd4 a2cc 6443 8378 23ac eafd b898 50f8  ....dC.x#.....P.
00000060: b9a9 b9f9 f8ec f745 333f 293f 3f07 9f7a  .......E3?)??..z
00000070: 1f46 547e 4a62 492a 3ef5 2e1c a8fc b49c  .FT~JbI*>.......
00000080: fcc4 129c ae67 6070 4373 bfbf 8f2b 3ee3  .....g`pCs...+>.
00000090: 19dc d1dd 9f99 9758 5489 5bbd 139a fa9c  .......XT.[.....
$ file bad.zip 
bad.zip: Microsoft Word 2007+

This file is matched because the string _rels/.rels appears in the file, even if it happens to be prepended by WTF/

@sindresorhus is it necessary to replicate the file Magic logic, or would it make more sense to jump to the central directory for zip files?

@andrei-yanovich
Copy link

I see this issue with a docx exported from google docs. version 8.1.0

@BradleyDHobbs
Copy link

I am still getting this issue for word files created in Word for Office 365 desktop application.

image

Any ideas?

@Borewit
Copy link
Collaborator

Borewit commented Jan 20, 2020

@BradleyDHobbs, please open a new issue, and refer to this one.

In that issue, can you zip and attach a small sample file?
Please double check it does not contain any personal information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants