Write performance #105

cbrake · 2020-04-24T17:03:49Z

I've been using bolthold on an embedded Linux system (eMMC storage). I'm noticing that as the DB grows, the write performance falls off linearly.

I'm using an increasing timestamp for the key, so I would think that would be sequential VS random access.

Bellow is the insert code:

// Sample represents a value in time and should include data that may be
// graphed.
type Sample struct {
	// Type of sample (voltage, current, key, etc)
	Type string `json:"type,omitempty" boltholdIndex:"Type" influx:"type,tag"`

	// ID of the device that provided the sample
	ID string `json:"id,omitempty" influx:"id,tag"`

	// Average OR
	// Instantaneous analog or digital value of the sample.
	// 0 and 1 are used to represent digital values
	Value float64 `json:"value,omitempty" influx:"value"`

	// statistical values that may be calculated
	Min float64 `json:"min,omitempty" influx:"min"`
	Max float64 `json:"max,omitempty" influx:"max"`

	// Time the sample was taken
	Time time.Time `json:"time,omitempty" boltholdKey:"Time" gob:"-" influx:"time"`

	// Duration over which the sample was taken
	Duration time.Duration `json:"duration,omitempty" influx:"duration"`

	// Tags are additional attributes used to describe the sample
	// You might add things like friendly name, etc.
	Tags map[string]string `json:"tags,omitempty" influx:"-"`

	// Attributes are additional numerical values
	Attributes map[string]float64 `json:"attributes,omitempty" influx:"-"`
}

// DataMeta is used to store meta information about data in the database
type DataMeta struct {
	SampleCount int
}

// WriteSample writes a sample to the database
// Samples are flow, pressure, amount, etc.
func (db *IsDb) WriteSample(sample data.Sample) error {
	dataMeta := DataMeta{}
	err := db.store.Get(0, &dataMeta)
	if err != nil {
		// attempt to init metadata
		_, err = db.GetSampleCount()
		if err != nil {
			return err
		}
	}
	err = db.store.Insert(sample.Time, sample)
	if err != nil {
		return err
	}

	dataMeta.SampleCount++
	return db.store.Upsert(0, &dataMeta)
}

Once I get to 100,000 samples or so, the performance is really slow (2+ seconds to insert a sample). I'm thinking something is not quite right, as I read about people using multi TB bolt databases, but it seems with my use case, there is no way this could work.

I tried setting FreelistType to FreelistMapType -- that did not seem to make any difference.

Appreciate any thoughts is this normal, or can this be optimized.

Cliff

timshannon · 2020-04-24T17:34:06Z

Do you get the same performance drop off if you don't use an index? If there is no index handling in the insert, then the performance should be the exact same as encoding time + normal bolt insert time.

cbrake · 2020-04-24T18:21:26Z

much flatter without index:

So, I guess with timeseries data, you don't really want to use an index because the index is huge.

Another way to do this might be to put each sample type in its own bucket.

Or, there may be a more efficient way to implement an index -- perhaps a separate bucket for each sample Type, and each index is a separate record in the bucket -- then adding records would be fast? Databases are fun to think about -- lots of tradeoffs to be made.

Thanks for the help!

timshannon · 2020-04-24T18:38:35Z

You should still be able to use indexes on time series data, but what I'm guessing is happening is that your index on "tag" might not be very unique. It's usually a good idea to have fairly unique values in indexes, however in a regular database it shouldn't impact performance that drastically during inserts.

However, what I do with indexes in bolthold is a pretty naive implementation. I simply store then entire index under one key value, so the less unique the index, the more and more gets stored (and thus decoded, and encoded) on each insert. I'm guessing that's what's happening with your scenario here.

I can make my index handling more like a "real" database, and split the values across multiple keys, but it'll take quite a bit of reworking.

I'll open an issue for that. I appreciate you bringing this up.

cbrake · 2020-04-24T18:44:03Z

yes, I'm using a small # of Types relative to the # of samples -- maybe 6 or so, so they are not very unique.

cbrake · 2020-04-24T18:58:16Z

one more note -- without an index, and with 500,000 samples in DB, the insert time is still ~50ms/sample -- this is great -- means I can use bolthold to record about any amount of timeseries data on this device. Currently using around 715 bytes/sample -- would like to experiment with protobuf to see if that would be faster/more efficient.

nicewook · 2020-07-17T00:28:16Z

Your discussion helped me a lot.
Do you think how many fields use index also affect the performance? hm... I need query for the logs with start/end date to 1,000,000 logs. so I need index. Can I ask you any suggestion?

timshannon · 2020-07-17T13:56:58Z

Having many indexes will definitely impact performance of inserts and updates, because those indexes will need to be maintained on every insert and update.

I wouldn't recommend putting an index on a date/time if you can help it. Go Time structs are very accurate, so you'll end up with very a non-unique index.

If you have start date and end date as fields, I would recommend having start as your key value, and always querying with the start date.

cbrake · 2020-07-18T18:46:24Z

One problem I ran into using the Go Time type as a key is the gob encoded data of Go Time is not always monotonic with time, so seeks to a date would not always work. When I converted a time stamps to int64, and inserted bytes into key in big-endian format, seeks were then very fast and reliable. I may be missing something though, but it seems since Go Time is a struct, the encoded data for it will likely not always be monotonic.

nicewook · 2020-07-20T00:38:41Z

@timshannon Thank you for your advice.

Use minimum indexes I can afford!
Do I need to use the tag key and index? or just key will work?
I tested with badgerhold - It works much better. but It require a lot of disk size.
I did not get it. could you explain a bit more?

If you have start date and end date as fields, I would recommend having start as your key value, and always querying with the start date.

@cbrake Thanks. I wil l try to use int64 (unix time)
so, I get query start/end data as RFC3339 and convert them to uint64.
then query to bolthold

timshannon self-assigned this Apr 24, 2020

cbrake closed this as completed Apr 24, 2020

timshannon mentioned this issue Apr 24, 2020

Handle large indexes better #106

Open

timshannon mentioned this issue Apr 12, 2021

Roaring Bitmaps for indexing timshannon/badgerhold#25

Open

rigon mentioned this issue Oct 19, 2023

Performance degradation in cache DB rigon/photo-gallery#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write performance #105

Write performance #105

cbrake commented Apr 24, 2020 •

edited

Loading

timshannon commented Apr 24, 2020

cbrake commented Apr 24, 2020

timshannon commented Apr 24, 2020

cbrake commented Apr 24, 2020

cbrake commented Apr 24, 2020

nicewook commented Jul 17, 2020

timshannon commented Jul 17, 2020

cbrake commented Jul 18, 2020 •

edited

Loading

nicewook commented Jul 20, 2020 •

edited

Loading

Write performance #105

Write performance #105

Comments

cbrake commented Apr 24, 2020 • edited Loading

timshannon commented Apr 24, 2020

cbrake commented Apr 24, 2020

timshannon commented Apr 24, 2020

cbrake commented Apr 24, 2020

cbrake commented Apr 24, 2020

nicewook commented Jul 17, 2020

timshannon commented Jul 17, 2020

cbrake commented Jul 18, 2020 • edited Loading

nicewook commented Jul 20, 2020 • edited Loading

cbrake commented Apr 24, 2020 •

edited

Loading

cbrake commented Jul 18, 2020 •

edited

Loading

nicewook commented Jul 20, 2020 •

edited

Loading