Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows} (see arrow#39399).
For each {engine, compression codec}:
- Engine: pyarrow, fastparquet
- Compression: snappy, gzip, brotli, lz4, zstd
parquet-diff-test
writes a simple Parquet file:
df = pd.DataFrame([{ 'a': 111 }])
empty_df = df.iloc[:0] # subset the dataset to have 0 rows
out_dir = f'out/{engine}/{compression}'
parquet_path = f'{out_dir}/empty.parquet'
empty_df.to_parquet(parquet_path, engine=engine, compression=compression)
In the same directory, it also writes:
metadata.json
, which includes:- the
pyarrow.ParquetFile.metadata
dictionary - file size
- file sha256 hash
- the
xxd.txt
: ASCII representation of every byte inempty.parquet
The test.yml workflow runs parquet-diff-test
on Ubuntu, macOS, and Windows, and pushes the results of each to a branch.
Here are the macos
and windows
branches' compared to ubuntu
:
- ✅ In all cases, Parquet files generated by
fastparquet
are identical .across OSes - 🤔 In many cases, those generated by
pyarrow
are different from each other.
Ubuntu | Windows | macOS | |
---|---|---|---|
brotli | ✅ | ✅ | ❌ |
gzip | ❌ | ||
lz4 | ✅ | ✅ | ❌ |
snappy | ✅ | ✅ | ❌ |
zstd | ✅ | ✅ | ❌ |
Ubuntu | Windows | macOS | |
---|---|---|---|
brotli | ✅ | ✅ | ✅ |
gzip | ✅ | ✅ | ✅ |
lz4 | ✅ | ✅ | ✅ |
snappy | ✅ | ✅ | ✅ |
zstd | ✅ | ✅ | ✅ |
- All
fastparquet
parquets are identical. - All
pyarrow
parquets differ.
For example, here's the diff for {pyarrow
, snappy
}:
git diff ubuntu..macos -- out/pyarrow/snappy/xxd.txt
00000280: 7741 4141 4145 4141 6741 4367 4141 414e wAAAAEAAgACgAAAN
00000290: 7742 4141 4145 4141 4141 4151 4141 4141 wBAAAEAAAAAQAAAA
000002a0: 7741 4141 4149 4141 7741 4241 4149 4141 wAAAAIAAwABAAIAA
-000002b0: 6741 4141 4149 4141 4141 4541 4141 4141 gAAAAIAAAAEAAAAA
-000002c0: 5941 4141 4277 5957 356b 5958 4d41 414b YAAABwYW5kYXMAAK
-000002d0: 5942 4141 4237 496d 6c75 5a47 5634 5832 YBAAB7ImluZGV4X2
-000002e0: 4e76 6248 5674 626e 4d69 4f69 4262 6579 NvbHVtbnMiOiBbey
-000002f0: 4a72 6157 356b 496a 6f67 496e 4a68 626d JraW5kIjogInJhbm
-00000300: 646c 4969 7767 496d 3568 6257 5569 4f69 dlIiwgIm5hbWUiOi
-00000310: 4275 6457 7873 4c43 4169 6333 5268 636e BudWxsLCAic3Rhcn
-00000320: 5169 4f69 4177 4c43 4169 6333 5276 6343 QiOiAwLCAic3RvcC
-00000330: 4936 4944 4173 4943 4a7a 6447 5677 496a I6IDAsICJzdGVwIj
-00000340: 6f67 4d58 3164 4c43 4169 5932 3973 6457 ogMX1dLCAiY29sdW
-00000350: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69 1uX2luZGV4ZXMiOi
-00000360: 4262 6579 4a75 5957 316c 496a 6f67 626e BbeyJuYW1lIjogbn
-00000370: 5673 6243 7767 496d 5a70 5a57 786b 5832 VsbCwgImZpZWxkX2
-00000380: 3568 6257 5569 4f69 4275 6457 7873 4c43 5hbWUiOiBudWxsLC
-00000390: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
-000003a0: 5569 4f69 4169 6457 3570 5932 396b 5a53 UiOiAidW5pY29kZS
-000003b0: 4973 4943 4a75 6457 3177 6556 3930 6558 IsICJudW1weV90eX
-000003c0: 426c 496a 6f67 496d 3969 616d 566a 6443 BlIjogIm9iamVjdC
-000003d0: 4973 4943 4a74 5a58 5268 5a47 4630 5953 IsICJtZXRhZGF0YS
-000003e0: 4936 4948 7369 5a57 356a 6232 5270 626d I6IHsiZW5jb2Rpbm
-000003f0: 6369 4f69 4169 5656 5247 4c54 6769 6658 ciOiAiVVRGLTgifX
-00000400: 3164 4c43 4169 5932 3973 6457 3175 6379 1dLCAiY29sdW1ucy
-00000410: 4936 4946 7437 496d 3568 6257 5569 4f69 I6IFt7Im5hbWUiOi
-00000420: 4169 5953 4973 4943 4a6d 6157 5673 5a46 AiYSIsICJmaWVsZF
-00000430: 3975 5957 316c 496a 6f67 496d 4569 4c43 9uYW1lIjogImEiLC
-00000440: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
-00000450: 5569 4f69 4169 6157 3530 4e6a 5169 4c43 UiOiAiaW50NjQiLC
-00000460: 4169 626e 5674 6348 6c66 6448 6c77 5a53 AibnVtcHlfdHlwZS
-00000470: 4936 4943 4a70 626e 5132 4e43 4973 4943 I6ICJpbnQ2NCIsIC
-00000480: 4a74 5a58 5268 5a47 4630 5953 4936 4947 JtZXRhZGF0YSI6IG
-00000490: 3531 6247 7839 5853 7767 496d 4e79 5a57 51bGx9XSwgImNyZW
-000004a0: 4630 6233 4969 4f69 4237 496d 7870 596e F0b3IiOiB7ImxpYn
-000004b0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e JhcnkiOiAicHlhcn
-000004c0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157 JvdyIsICJ2ZXJzaW
-000004d0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69 9uIjogIjE0LjAuMi
-000004e0: 4a39 4c43 4169 6347 4675 5a47 467a 5833 J9LCAicGFuZGFzX3
-000004f0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69 ZlcnNpb24iOiAiMi
-00000500: 3478 4c6a 5169 6651 4141 4151 4141 4142 4xLjQifQAAAQAAAB
+000002b0: 6741 4141 4330 4151 4141 4241 4141 414b gAAAC0AQAABAAAAK
+000002c0: 5942 4141 4237 496d 6c75 5a47 5634 5832 YBAAB7ImluZGV4X2
+000002d0: 4e76 6248 5674 626e 4d69 4f69 4262 6579 NvbHVtbnMiOiBbey
+000002e0: 4a72 6157 356b 496a 6f67 496e 4a68 626d JraW5kIjogInJhbm
+000002f0: 646c 4969 7767 496d 3568 6257 5569 4f69 dlIiwgIm5hbWUiOi
+00000300: 4275 6457 7873 4c43 4169 6333 5268 636e BudWxsLCAic3Rhcn
+00000310: 5169 4f69 4177 4c43 4169 6333 5276 6343 QiOiAwLCAic3RvcC
+00000320: 4936 4944 4173 4943 4a7a 6447 5677 496a I6IDAsICJzdGVwIj
+00000330: 6f67 4d58 3164 4c43 4169 5932 3973 6457 ogMX1dLCAiY29sdW
+00000340: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69 1uX2luZGV4ZXMiOi
+00000350: 4262 6579 4a75 5957 316c 496a 6f67 626e BbeyJuYW1lIjogbn
+00000360: 5673 6243 7767 496d 5a70 5a57 786b 5832 VsbCwgImZpZWxkX2
+00000370: 3568 6257 5569 4f69 4275 6457 7873 4c43 5hbWUiOiBudWxsLC
+00000380: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
+00000390: 5569 4f69 4169 6457 3570 5932 396b 5a53 UiOiAidW5pY29kZS
+000003a0: 4973 4943 4a75 6457 3177 6556 3930 6558 IsICJudW1weV90eX
+000003b0: 426c 496a 6f67 496d 3969 616d 566a 6443 BlIjogIm9iamVjdC
+000003c0: 4973 4943 4a74 5a58 5268 5a47 4630 5953 IsICJtZXRhZGF0YS
+000003d0: 4936 4948 7369 5a57 356a 6232 5270 626d I6IHsiZW5jb2Rpbm
+000003e0: 6369 4f69 4169 5656 5247 4c54 6769 6658 ciOiAiVVRGLTgifX
+000003f0: 3164 4c43 4169 5932 3973 6457 3175 6379 1dLCAiY29sdW1ucy
+00000400: 4936 4946 7437 496d 3568 6257 5569 4f69 I6IFt7Im5hbWUiOi
+00000410: 4169 5953 4973 4943 4a6d 6157 5673 5a46 AiYSIsICJmaWVsZF
+00000420: 3975 5957 316c 496a 6f67 496d 4569 4c43 9uYW1lIjogImEiLC
+00000430: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
+00000440: 5569 4f69 4169 6157 3530 4e6a 5169 4c43 UiOiAiaW50NjQiLC
+00000450: 4169 626e 5674 6348 6c66 6448 6c77 5a53 AibnVtcHlfdHlwZS
+00000460: 4936 4943 4a70 626e 5132 4e43 4973 4943 I6ICJpbnQ2NCIsIC
+00000470: 4a74 5a58 5268 5a47 4630 5953 4936 4947 JtZXRhZGF0YSI6IG
+00000480: 3531 6247 7839 5853 7767 496d 4e79 5a57 51bGx9XSwgImNyZW
+00000490: 4630 6233 4969 4f69 4237 496d 7870 596e F0b3IiOiB7ImxpYn
+000004a0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e JhcnkiOiAicHlhcn
+000004b0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157 JvdyIsICJ2ZXJzaW
+000004c0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69 9uIjogIjE0LjAuMi
+000004d0: 4a39 4c43 4169 6347 4675 5a47 467a 5833 J9LCAicGFuZGFzX3
+000004e0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69 ZlcnNpb24iOiAiMi
+000004f0: 3478 4c6a 5169 6651 4141 4267 4141 4148 4xLjQifQAABgAAAH
+00000500: 4268 626d 5268 6377 4141 4151 4141 4142 BhbmRhcwAAAQAAAB
00000510: 5141 4141 4151 4142 5141 4341 4147 4141 QAAAAQABQACAAGAA
00000520: 6341 4441 4141 4142 4141 4541 4141 4141 cADAAAABAAEAAAAA
00000530: 4141 4151 4951 4141 4141 4841 4141 4141 AAAQIQAAAAHAAAAA
The pyarrow
metadata is the same for both; I can't tell what explains the difference.
- All
fastparquet
parquets are identical. pyarrow
parquets are mostly identical, except for one header byte in thegzip
codec.
git diff ubuntu..windows -- out/pyarrow/gzip/xxd.txt
00000000: 5041 5231 1504 1500 1528 4c15 0015 0012 PAR1.....(L.....
-00000010: 0000 1f8b 0800 0000 0000 0003 0300 0000 ................
+00000010: 0000 1f8b 0800 0000 0000 000a 0300 0000 ................
00000020: 0000 0000 0000 264c 1c15 0419 2500 0619 ......&L....%...
00000030: 1801 6115 0416 0016 1c16 4426 0026 0829 ..a.......D&.&.)
00000040: 1c15 0415 0015 0200 0000 1504 192c 3500 .............,5.
The discrepancy between macOS and Ubuntu has made some tests inconvenient; it would be nice to understand why it occurs.
Interestingly, I see the same macOS diffs when running run.sh
in an ubuntu
Docker image on a macOS host machine