Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write encoding of type DELTA_BINARY_PACKED corrupts file #190

Closed
jaredmessenger opened this issue Nov 26, 2019 · 1 comment
Closed

Write encoding of type DELTA_BINARY_PACKED corrupts file #190

jaredmessenger opened this issue Nov 26, 2019 · 1 comment

Comments

@jaredmessenger
Copy link

Trying to use DELTA_BINARY_PACKED for timestamps corrupts the file.

Local flat example, but with a timestamp.

type Delta struct {
	Timestamp int64 `parquet:"name=timestamp, type=TIMESTAMP_MILLIS, encoding=DELTA_BINARY_PACKED"`
	Value     string `parquet:"name=value, type=UTF8, encoding=DELTA_LENGTH_BYTE_ARRAY"`
}

func main() {
	var err error
	fw, err := NewLocalFile("flat.parquet")
	if err != nil {
		log.Println("Can't create local file", err)
		return
	}

	//write
	pw, err := writer.NewParquetWriter(fw, new(Delta), 4)
	if err != nil {
		log.Println("Can't create parquet writer", err)
		return
	}

	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	num := 100
	for i := 0; i < num; i++ {
		ts := time.Now().UnixNano() / 1e6
		fmt.Println(ts)
		d := Delta{
			Timestamp: ts,
			Value: "SomeString",
		}
		if err = pw.Write(d); err != nil {
			log.Println("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
		return
	}
	log.Println("Write Finished")
	fw.Close()
}

Reading from parquet-tool

./parquet-tools -cmd cat -file flat.parquet 
panic: runtime error: integer divide by zero

Using Apache Brew parquet-tools

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file
@xitongsys
Copy link
Owner

hi, @jaredmessenger
Sorry for late response, I'm too busy recently.

It's a bug and i have fixed it. please use the latest codes on master branch to test again.
My test codes:

package main

import (
	"log"
	"time"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
	"github.com/xitongsys/parquet-go/writer"
	"github.com/xitongsys/parquet-go/parquet"
)

type Delta struct {
	Timestamp int64 `parquet:"name=timestamp, type=INT64, encoding=DELTA_BINARY_PACKED"`
	Value     string `parquet:"name=value, type=UTF8, encoding=DELTA_LENGTH_BYTE_ARRAY"`
}

func main() {
	var err error
	fw, err := local.NewLocalFileWriter("flat.parquet")
	if err != nil {
		log.Println("Can't create local file", err)
		return
	}

	//write
	pw, err := writer.NewParquetWriter(fw, new(Delta), 4)
	if err != nil {
		log.Println("Can't create parquet writer", err)
		return
	}

	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	num := 2 
	for i := 0; i < num; i++ {
		stu := Delta{
			Timestamp: time.Now().UnixNano() / 1e6,
			Value: "SomeString",
		}
		if err = pw.Write(stu); err != nil {
			log.Println("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
		return
	}
	log.Println("Write Finished")
	fw.Close()

	///read
	fr, err := local.NewLocalFileReader("flat.parquet")
	if err != nil {
		log.Println("Can't open file")
		return
	}

	pr, err := reader.NewParquetReader(fr, new(Delta), 4)
	if err != nil {
		log.Println("Can't create parquet reader", err)
		return
	}
	num = int(pr.GetNumRows())
	stus := make([]Delta, num)
	if err = pr.Read(&stus); err != nil {
		log.Println("Read error", err)
	}
	log.Println(stus)

	pr.ReadStop()
	fr.Close()

}

Result:

 go run b.go                                                                                                     
2019/12/17 10:23:53 Write Finished
2019/12/17 10:23:53 [{1576549433357 SomeString} {1576549433357 SomeString}]

Using Apache parquet-tools.jar

 java -jar parquet-tools-1.10.1.jar cat .\flat.parquet                        
                            
timestamp = 1576549433357
value = SomeString

timestamp = 1576549433357
value = SomeString

zolstein pushed a commit to zolstein/parquet-go that referenced this issue Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants