Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write performance #105

Closed
cbrake opened this issue Apr 24, 2020 · 9 comments
Closed

Write performance #105

cbrake opened this issue Apr 24, 2020 · 9 comments
Assignees

Comments

@cbrake
Copy link

cbrake commented Apr 24, 2020

I've been using bolthold on an embedded Linux system (eMMC storage). I'm noticing that as the DB grows, the write performance falls off linearly.

bolthold-sample-vs-insert-time

I'm using an increasing timestamp for the key, so I would think that would be sequential VS random access.

Bellow is the insert code:

// Sample represents a value in time and should include data that may be
// graphed.
type Sample struct {
	// Type of sample (voltage, current, key, etc)
	Type string `json:"type,omitempty" boltholdIndex:"Type" influx:"type,tag"`

	// ID of the device that provided the sample
	ID string `json:"id,omitempty" influx:"id,tag"`

	// Average OR
	// Instantaneous analog or digital value of the sample.
	// 0 and 1 are used to represent digital values
	Value float64 `json:"value,omitempty" influx:"value"`

	// statistical values that may be calculated
	Min float64 `json:"min,omitempty" influx:"min"`
	Max float64 `json:"max,omitempty" influx:"max"`

	// Time the sample was taken
	Time time.Time `json:"time,omitempty" boltholdKey:"Time" gob:"-" influx:"time"`

	// Duration over which the sample was taken
	Duration time.Duration `json:"duration,omitempty" influx:"duration"`

	// Tags are additional attributes used to describe the sample
	// You might add things like friendly name, etc.
	Tags map[string]string `json:"tags,omitempty" influx:"-"`

	// Attributes are additional numerical values
	Attributes map[string]float64 `json:"attributes,omitempty" influx:"-"`
}

// DataMeta is used to store meta information about data in the database
type DataMeta struct {
	SampleCount int
}

// WriteSample writes a sample to the database
// Samples are flow, pressure, amount, etc.
func (db *IsDb) WriteSample(sample data.Sample) error {
	dataMeta := DataMeta{}
	err := db.store.Get(0, &dataMeta)
	if err != nil {
		// attempt to init metadata
		_, err = db.GetSampleCount()
		if err != nil {
			return err
		}
	}
	err = db.store.Insert(sample.Time, sample)
	if err != nil {
		return err
	}

	dataMeta.SampleCount++
	return db.store.Upsert(0, &dataMeta)
}

Once I get to 100,000 samples or so, the performance is really slow (2+ seconds to insert a sample). I'm thinking something is not quite right, as I read about people using multi TB bolt databases, but it seems with my use case, there is no way this could work.

I tried setting FreelistType to FreelistMapType -- that did not seem to make any difference.

Appreciate any thoughts is this normal, or can this be optimized.

Cliff

@timshannon timshannon self-assigned this Apr 24, 2020
@timshannon
Copy link
Owner

Do you get the same performance drop off if you don't use an index? If there is no index handling in the insert, then the performance should be the exact same as encoding time + normal bolt insert time.

@cbrake
Copy link
Author

cbrake commented Apr 24, 2020

much flatter without index:

bolthold-sample-vs-insert-time-without-index

So, I guess with timeseries data, you don't really want to use an index because the index is huge.

Another way to do this might be to put each sample type in its own bucket.

Or, there may be a more efficient way to implement an index -- perhaps a separate bucket for each sample Type, and each index is a separate record in the bucket -- then adding records would be fast? Databases are fun to think about -- lots of tradeoffs to be made.

Thanks for the help!

@cbrake cbrake closed this as completed Apr 24, 2020
@timshannon
Copy link
Owner

You should still be able to use indexes on time series data, but what I'm guessing is happening is that your index on "tag" might not be very unique. It's usually a good idea to have fairly unique values in indexes, however in a regular database it shouldn't impact performance that drastically during inserts.

However, what I do with indexes in bolthold is a pretty naive implementation. I simply store then entire index under one key value, so the less unique the index, the more and more gets stored (and thus decoded, and encoded) on each insert. I'm guessing that's what's happening with your scenario here.

I can make my index handling more like a "real" database, and split the values across multiple keys, but it'll take quite a bit of reworking.

I'll open an issue for that. I appreciate you bringing this up.

@cbrake
Copy link
Author

cbrake commented Apr 24, 2020

yes, I'm using a small # of Types relative to the # of samples -- maybe 6 or so, so they are not very unique.

@cbrake
Copy link
Author

cbrake commented Apr 24, 2020

one more note -- without an index, and with 500,000 samples in DB, the insert time is still ~50ms/sample -- this is great -- means I can use bolthold to record about any amount of timeseries data on this device. Currently using around 715 bytes/sample -- would like to experiment with protobuf to see if that would be faster/more efficient.

@nicewook
Copy link

Your discussion helped me a lot.
Do you think how many fields use index also affect the performance? hm... I need query for the logs with start/end date to 1,000,000 logs. so I need index. Can I ask you any suggestion?

@timshannon
Copy link
Owner

Having many indexes will definitely impact performance of inserts and updates, because those indexes will need to be maintained on every insert and update.

I wouldn't recommend putting an index on a date/time if you can help it. Go Time structs are very accurate, so you'll end up with very a non-unique index.

If you have start date and end date as fields, I would recommend having start as your key value, and always querying with the start date.

@cbrake
Copy link
Author

cbrake commented Jul 18, 2020

One problem I ran into using the Go Time type as a key is the gob encoded data of Go Time is not always monotonic with time, so seeks to a date would not always work. When I converted a time stamps to int64, and inserted bytes into key in big-endian format, seeks were then very fast and reliable. I may be missing something though, but it seems since Go Time is a struct, the encoded data for it will likely not always be monotonic.

@nicewook
Copy link

nicewook commented Jul 20, 2020

@timshannon Thank you for your advice.

  1. Use minimum indexes I can afford!
  2. Do I need to use the tag key and index? or just key will work?
  3. I tested with badgerhold - It works much better. but It require a lot of disk size.
  4. I did not get it. could you explain a bit more?

If you have start date and end date as fields, I would recommend having start as your key value, and always querying with the start date.

@cbrake Thanks. I wil l try to use int64 (unix time)
so, I get query start/end data as RFC3339 and convert them to uint64.
then query to bolthold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants