-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Race condition causing query results to be incomplete on cold UID caches #1760
Comments
muffix
added a commit
to muffix/opentsdb
that referenced
this issue
Aug 27, 2019
Synchronises the list that holds the KeyValues that have been produced by the scanner callbacks. The list is accessed from multiple threads at a time and wasn't thread-safe, causing inconsistent results and partial loss of data in the response. Relates to: OpenTSDB#1753 Resolves: OpenTSDB#1760
johann8384
pushed a commit
that referenced
this issue
Feb 24, 2021
#1458) * For branch next, add an expression function named FirstDifference, which calculates the first difference of a time series. I noticed there is MovingAverage calculation, so I thought maybe I can enrich the mathematics functions into that. * add some unit tests for FirstDifference Bump version to 2.5.0-SNAPSHOT. Fix a compilation error about missing FirstDifference (#1471) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Bugfix of FsckOptions. (#1464) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> CORE: (#1472) - Add RpcResponder for handling callbacks asynchronously UTILS: - Add two convenient methods in Config Signed-off-by: Chris Larsen <clarsen@verizonmedia.com> fix #1581 by correcting an edge case in TsdbQuery.getScanEndTimeSeconds() (#1582) Dockerfile that works without a script. (#1739) replace FOREVER with a valid value in table creation (#1967) Co-authored-by: Ion DULGHERU <ion.dulgheru@gmail.com> Jackson has a serious security problem in 2.9.5, which will cause RCE (#2034) * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 Co-authored-by: chi-chi weng <949409306@qq.com> Pr 1663 (#1966) * Make UniqueIdRpc aware of the mode * Update javadoc on new method and rename test methods to be more descriptive Co-authored-by: Simon Matic Langford <simon@exemel.co.uk> Re-introduce query timeouts. (#2035) Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Updating maven central urls and versions to match what is available now (#2039) Fixes #1899 Fixes #1941 always write cli tools to stdout (#1488) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Fix #1632 (#1634) Add "check_tsd_v2" script (#1567) Enhanced check_tsd script evaluates each individual metric group separately when given a filter Collect stats from meta cache plugin if configured (#1649) Fix SaltScanner race condition on spans maps (#1651) * Fix SaltScanner race condition on spans maps * Fix 1.6 compatibility Synchronise the KVs list for scanner results Synchronises the list that holds the KeyValues that have been produced by the scanner callbacks. The list is accessed from multiple threads at a time and wasn't thread-safe, causing inconsistent results and partial loss of data in the response. Relates to: #1753 Resolves: #1760 Allow rollup downsample and series aggregator to be different Fix TestSaltScannerHistogram, looks like the method was renamed and the UTs were not adjusted. ExplicitTags filtering with FuzzyFilters Fix PR 1896 with the fuzzy filter list so that it will honor the regex filter and properly ignore rows that don't match the explicit filter. Also sort the fuzzy filter list in ascending order and implement a static comparator instead of instantiating one on each call. Test rollup filter fix for #1083 Fix concurrent result reporting from scanners Fixes a concurrency bug where scanners report their results into a map and would overwrite each other's results Resolves: #1753 Update Maven jars URLs with HTTPS access Remove excess param in javadoc for RpcHandler Fix check_tsd_v2 (#1937) * renamed instancename of logger The previous name was copied from another script, cosmetic change only * Change behaviour of --ignore-recent option Previous option would fetch data from opentsdb from --duration seconds ago to time.now(), and then try to remove timestamps that was inside the --ignore-recent seconds ago, however the logic was flawed and it actually only included these seconds. Furthermore opentsdb supports setting an "end" parameter, so we use this to only get the data we want. for example -d 180 -I 80, would render a query parameter that looks like `?start=180s-ago&end=80s-ago`. Keeps it simple. Also added debuglogging to output the actual query sent to OpenTSDB if --debug option is enabled. * fixed logic of --percent-over parameter Previous behaviour didn't work due to wrong logic, would set "crit" or "warn" to True regardless. This change fixes that. * better output from logging Add logmessages to be consistent across alerting-scenarios, and changed format of some floats. Fixed a log messaged that displayed "crit" value where it should have been "warn" value. * Fixed bug in logic that parses results Removed an if statement that `continue`:ed the for-loop if a result was neither a `crit` or `warn` already, however this check also made the logic skip the test to see if no values were returned by opentsdb and -A flag was specified to alert in such scenarios. * changed check for timestamps type Previous behaviour was to check if a timestamp could be cast as a float, which is a bit weird, because opentsdb will return integers. I do doubt that opentsdb would return a timestamp that is not an integer to begin with, so i suspect this check is redundant, but leaving it in for now regardless, as per discussion in PR. Rename maxScannerUidtoStringTime into maxScannerUidToStringTime (#1875) Fix the missing index from #1754 in the salt scanner. Force Sunday as first day of week. Tweak TestTsdbQueryQueries to pass in older java versions. Fix the min case for doubles in AggregationIterator. Fix the Screw Driver config. Fix UT for JDK8 PR for SD config. Fix: Rollup queries with count aggregator produce unexpected results (#1895) Co-authored-by: Tony Di Nucci <tony.dinucci@skyscanner.net> Fixed function description Fixes #841 (#2040) Added tracking of metrics which are null due to auto_metric being disabled Fixes #786 (#2042) Add support for splitting rollup queries (#1853) * Add an SLA config flag for rollup intervals Adds a configuration option for rollup intervals to specify their maximum acceptable delay. Queries that cover a time between now and that maximum delay will need to query other tables for that time interval. * Add global config flag to enable splitting queries Adds a global config flag to enable splitting queries that would hit the rollup table, but the rollup table has a delay SLA configured. In that case, this feature allows splitting a query into to; one that gets the data from the rollups table until the time where it's guaranteed to be available, and the rest from the raw table. * Add a new SplitRollupQuery Adds a SplitRollupQuery class that suports splitting a rollup query into two separate queries. This is useful for when a rollup table is filled by e.g. a batch job that processes the data from the previous day on a daily basis. Rollup data for yesterday will then only be available some time today. This delay SLA can be configured on a per-table basis. The delay would specify by how much time the table can be behind real time. If a query comes in that would query data from that blackout period where data is only available in the raw table, but not yet guaranteed to be in the rollup table, the incoming query can be split into two using the SplitRollupQuery class. It wraps a query that queries the rollup table until the last guaranteed to be available timestamp based on the SLA; and one that gets the remaining data from the raw table. * Extract an AbstractQuery Extracts an AbstractQuery from the TsdbQuery implementation since we'd like to reuse some parts of it in other Query classes (in this case SplitRollupQuery) * Extract an AbstractSpanGroup * Avoid NullPointerException when setting start time Avoids a NullPointerException that happened when we were trying to set the start time on a query that would be eligible to split, but due to the SLA config only hit the raw table anyway. * Scale timestamps to milliseconds for split queries Scales all timestamps for split queries to milliseconds. It's important to maintain consistent units between all the partial queries that make up the bigger one. * Fix starting time error for split queries Fixes a bug that would happen when the start time of a query aligns perfectly with the time configured in the SLA for the delay of a rollup table. For a defined SLA, e.g. 1 day, if the start time of the query was exactly 1 day ago, the end time of the rollups part of the query would be updated and then be equal to its start time. That isn't allowed and causes a query exception.
bshakur8
pushed a commit
to bshakur8/opentsdb
that referenced
this issue
Oct 27, 2021
OpenTSDB#1458) * For branch next, add an expression function named FirstDifference, which calculates the first difference of a time series. I noticed there is MovingAverage calculation, so I thought maybe I can enrich the mathematics functions into that. * add some unit tests for FirstDifference Bump version to 2.5.0-SNAPSHOT. Fix a compilation error about missing FirstDifference (OpenTSDB#1471) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Bugfix of FsckOptions. (OpenTSDB#1464) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> CORE: (OpenTSDB#1472) - Add RpcResponder for handling callbacks asynchronously UTILS: - Add two convenient methods in Config Signed-off-by: Chris Larsen <clarsen@verizonmedia.com> fix OpenTSDB#1581 by correcting an edge case in TsdbQuery.getScanEndTimeSeconds() (OpenTSDB#1582) Dockerfile that works without a script. (OpenTSDB#1739) replace FOREVER with a valid value in table creation (OpenTSDB#1967) Co-authored-by: Ion DULGHERU <ion.dulgheru@gmail.com> Jackson has a serious security problem in 2.9.5, which will cause RCE (OpenTSDB#2034) * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 Co-authored-by: chi-chi weng <949409306@qq.com> Pr 1663 (OpenTSDB#1966) * Make UniqueIdRpc aware of the mode * Update javadoc on new method and rename test methods to be more descriptive Co-authored-by: Simon Matic Langford <simon@exemel.co.uk> Re-introduce query timeouts. (OpenTSDB#2035) Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Updating maven central urls and versions to match what is available now (OpenTSDB#2039) Fixes OpenTSDB#1899 Fixes OpenTSDB#1941 always write cli tools to stdout (OpenTSDB#1488) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Fix OpenTSDB#1632 (OpenTSDB#1634) Add "check_tsd_v2" script (OpenTSDB#1567) Enhanced check_tsd script evaluates each individual metric group separately when given a filter Collect stats from meta cache plugin if configured (OpenTSDB#1649) Fix SaltScanner race condition on spans maps (OpenTSDB#1651) * Fix SaltScanner race condition on spans maps * Fix 1.6 compatibility Synchronise the KVs list for scanner results Synchronises the list that holds the KeyValues that have been produced by the scanner callbacks. The list is accessed from multiple threads at a time and wasn't thread-safe, causing inconsistent results and partial loss of data in the response. Relates to: OpenTSDB#1753 Resolves: OpenTSDB#1760 Allow rollup downsample and series aggregator to be different Fix TestSaltScannerHistogram, looks like the method was renamed and the UTs were not adjusted. ExplicitTags filtering with FuzzyFilters Fix PR 1896 with the fuzzy filter list so that it will honor the regex filter and properly ignore rows that don't match the explicit filter. Also sort the fuzzy filter list in ascending order and implement a static comparator instead of instantiating one on each call. Test rollup filter fix for OpenTSDB#1083 Fix concurrent result reporting from scanners Fixes a concurrency bug where scanners report their results into a map and would overwrite each other's results Resolves: OpenTSDB#1753 Update Maven jars URLs with HTTPS access Remove excess param in javadoc for RpcHandler Fix check_tsd_v2 (OpenTSDB#1937) * renamed instancename of logger The previous name was copied from another script, cosmetic change only * Change behaviour of --ignore-recent option Previous option would fetch data from opentsdb from --duration seconds ago to time.now(), and then try to remove timestamps that was inside the --ignore-recent seconds ago, however the logic was flawed and it actually only included these seconds. Furthermore opentsdb supports setting an "end" parameter, so we use this to only get the data we want. for example -d 180 -I 80, would render a query parameter that looks like `?start=180s-ago&end=80s-ago`. Keeps it simple. Also added debuglogging to output the actual query sent to OpenTSDB if --debug option is enabled. * fixed logic of --percent-over parameter Previous behaviour didn't work due to wrong logic, would set "crit" or "warn" to True regardless. This change fixes that. * better output from logging Add logmessages to be consistent across alerting-scenarios, and changed format of some floats. Fixed a log messaged that displayed "crit" value where it should have been "warn" value. * Fixed bug in logic that parses results Removed an if statement that `continue`:ed the for-loop if a result was neither a `crit` or `warn` already, however this check also made the logic skip the test to see if no values were returned by opentsdb and -A flag was specified to alert in such scenarios. * changed check for timestamps type Previous behaviour was to check if a timestamp could be cast as a float, which is a bit weird, because opentsdb will return integers. I do doubt that opentsdb would return a timestamp that is not an integer to begin with, so i suspect this check is redundant, but leaving it in for now regardless, as per discussion in PR. Rename maxScannerUidtoStringTime into maxScannerUidToStringTime (OpenTSDB#1875) Fix the missing index from OpenTSDB#1754 in the salt scanner. Force Sunday as first day of week. Tweak TestTsdbQueryQueries to pass in older java versions. Fix the min case for doubles in AggregationIterator. Fix the Screw Driver config. Fix UT for JDK8 PR for SD config. Fix: Rollup queries with count aggregator produce unexpected results (OpenTSDB#1895) Co-authored-by: Tony Di Nucci <tony.dinucci@skyscanner.net> Fixed function description Fixes OpenTSDB#841 (OpenTSDB#2040) Added tracking of metrics which are null due to auto_metric being disabled Fixes OpenTSDB#786 (OpenTSDB#2042) Add support for splitting rollup queries (OpenTSDB#1853) * Add an SLA config flag for rollup intervals Adds a configuration option for rollup intervals to specify their maximum acceptable delay. Queries that cover a time between now and that maximum delay will need to query other tables for that time interval. * Add global config flag to enable splitting queries Adds a global config flag to enable splitting queries that would hit the rollup table, but the rollup table has a delay SLA configured. In that case, this feature allows splitting a query into to; one that gets the data from the rollups table until the time where it's guaranteed to be available, and the rest from the raw table. * Add a new SplitRollupQuery Adds a SplitRollupQuery class that suports splitting a rollup query into two separate queries. This is useful for when a rollup table is filled by e.g. a batch job that processes the data from the previous day on a daily basis. Rollup data for yesterday will then only be available some time today. This delay SLA can be configured on a per-table basis. The delay would specify by how much time the table can be behind real time. If a query comes in that would query data from that blackout period where data is only available in the raw table, but not yet guaranteed to be in the rollup table, the incoming query can be split into two using the SplitRollupQuery class. It wraps a query that queries the rollup table until the last guaranteed to be available timestamp based on the SLA; and one that gets the remaining data from the raw table. * Extract an AbstractQuery Extracts an AbstractQuery from the TsdbQuery implementation since we'd like to reuse some parts of it in other Query classes (in this case SplitRollupQuery) * Extract an AbstractSpanGroup * Avoid NullPointerException when setting start time Avoids a NullPointerException that happened when we were trying to set the start time on a query that would be eligible to split, but due to the SLA config only hit the raw table anyway. * Scale timestamps to milliseconds for split queries Scales all timestamps for split queries to milliseconds. It's important to maintain consistent units between all the partial queries that make up the bigger one. * Fix starting time error for split queries Fixes a bug that would happen when the start time of a query aligns perfectly with the time configured in the SLA for the delay of a rollup table. For a defined SLA, e.g. 1 day, if the start time of the query was exactly 1 day ago, the end time of the rollups part of the query would be updated and then be equal to its start time. That isn't allowed and causes a query exception.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a similar, but different bug from #1753.
Whenever the scanners finish fetching rows from HBase, they compact the cells and add the new points to a list. The list is not thread-safe, but accessed by multiple threads at a time. This can lead to concurrent modifications and inconsistent query results.
We observed unstable query results when OpenTSDB had to resolve many UIDs to their string values due to the use of regex filters (code starting here). Our query was looking at high-cardinality metrics, forcing it to resolve ~65000 UIDs.
On fresh/empty UID caches, these resolutions trigger the resolution and match callbacks asynchronously and from different threads, which means that
processRow()
is also called concurrently from the callback on different threads, modifying the non-thread-safekvs
list.We were able to reliably reproduce the issue with queries for high-cardinality metrics. Symptoms include
ConcurrentModificationException
s andFound a key value item that was null
in the logs.The text was updated successfully, but these errors were encountered: