-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest: correctly measure chained pipeline stats #33912
ingest: correctly measure chained pipeline stats #33912
Conversation
Prior to this change when a pipeline processor called another pipeline, only the stats for the first processor were recorded. The stats for the subsequent pipelines were ignored. This change properly accounts for pipelines irregardless if they are the first or subsequently called pipelines. This change moves the state of the stats from the IngestService to the pipeline itself. Cluster updates are safe since the pipelines map is now a ConcurrentHashMap, and if a cluster update happens while iterating over stats (now read directly from the pipeline) a slightly stale view of stats may shown.
Pinging @elastic/es-core-infra |
Jenkins test this please. |
@@ -77,10 +76,9 @@ | |||
// We know of all the processor factories when a node with all its plugin have been initialized. Also some | |||
// processor factories rely on other node services. Custom metadata is statically registered when classes | |||
// are loaded, so in the cluster state we just save the pipeline config and here we keep the actual pipelines around. | |||
private volatile Map<String, Pipeline> pipelines = new HashMap<>(); | |||
private volatile Map<String, Pipeline> pipelines = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the whole reason we have a volatile
and just use a normal non-concurrent HashMap
is that we're only updating this field's reference but never update a specific instance of the map => I don't think you need to move to CHM here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah.. you are right updates are rebuilt local to a method then atomically swapped to an unmodified map. => no need for CHM
will put back to normal HM and update commit message.
for (Map.Entry<String, StatsHolder> entry : statsHolderPerPipeline.entrySet()) { | ||
statsPerPipeline.put(entry.getKey(), entry.getValue().createStats()); | ||
} | ||
Map<String, IngestStats.Stats> statsPerPipeline = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think even in 2018
JDKs the old version with the manual iteration is faster. I don't think it matters much here, but probably not worth the noise to change to the functional iteration here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable name is the same, but the object and how it iterates is different, hence the update. (e.g. more then noise)
new TestProcessor(ingestDocument -> { | ||
ingestDocument.setFieldValue(key1, randomInt()); | ||
try { | ||
Thread.sleep(100); //force the stat time to be non-zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it technically suffice to sleep for 2
here (or even 1
, not sure about the potential rounding/accuracy issues with that and nanotime()
, but 2
should be fine for getting a visible change in the metric?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, 100 is bit overkill. I will drop this down.
@@ -257,10 +255,19 @@ public IngestInfo info() { | |||
@Override | |||
public void applyClusterState(final ClusterChangedEvent event) { | |||
ClusterState state = event.state(); | |||
int beforeHashCode = pipelines.hashCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think you can just keep a reference here to the pipelines
field and then use straight instance inequality instead of hash codes below to check if the pipelines changed. That would be a little clearer since we specifically use the volatile
pipelines in a way so that we only ever replace the pipelines
instance as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. now that i see the update is is an atomic swap I can clean this up by holding onto the original reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few nits :)
If @rjernst is happy I'm happy
@original-brownbear thanks for the review. changes added on 526ae4b |
@rjernst - ping |
Jenkins test this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple questions
new TestProcessor(ingestDocument -> { | ||
ingestDocument.setFieldValue(key1, randomInt()); | ||
try { | ||
Thread.sleep(2); //force the stat time to be non-zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you mock the clock instead? time can go backwards due to correction, thus this is not guaranteed to be non-zero. Mocking would also allow better control to test the behavior of adding/joining the metrics objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
class IngestMetric { | ||
|
||
private final MeanMetric ingestMetric = new MeanMetric(); | ||
private final CounterMetric ingestCurrent = new CounterMetric(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "current"? I see this existed before, but the purpose is unclear. Some java docs here would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current
is briefly incremented while the ingest thing that is getting measured (total, pipeline, (soon) processor) is working on a document. It makes the most sense for total to see how many documents are currently being processed.
I will add some Javadoc.
} | ||
|
||
void add(IngestMetric metrics){ | ||
ingestCount.inc(metrics.ingestCount.count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this missing "current"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but I see why it looks that.
add
is used to carry forward metrics between cluster state changes (when the pipeline changes). When the cluster state changes for the pipelines, the pipeline instances are rebuilt, and now (with this change) the metrics get wiped out too. Add is used to add the old metrics to the metrics on the newly created pipeline instances (pre-populate/carry forward the old). Since the old metrics are a snapshot at time of cluster state change, we don't want to carry forward current
since it will never be decremented. current
may incorrectly show zero value during pipelines cluster state change until the existing bulk request is finished processing. This should be fairly rare, fairly short time period where current drops to zero.
I can change the name to carryForward, prePopulate ? other suggestions to make this more obvious ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully the added javadoc sufficiently addresses the missing addition of current
.
@rjernst - changes implemented, can you take another look ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
*/ | ||
private final MeanMetric ingestTime = new MeanMetric(); | ||
/** | ||
* The current count of things being measure. Should mostly like ever be 0 or 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wording is confusing here: most likely?
} | ||
|
||
/** | ||
* Add two sets of metrics together. *Important* does NOT add the current count intentionally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this important section down to its own paragraph, and please use proper javadoc formatting for emphasis and paragraph breaks.
@rjernst thanks! I have addressed the minor javadoc comments and will merge on green. |
Prior to this change when a pipeline processor called another pipeline, only the stats for the first processor were recorded. The stats for the subsequent pipelines were ignored. This change properly accounts for pipelines irregardless if they are the first or subsequently called pipelines. This change moves the state of the stats from the IngestService to the pipeline itself. Cluster updates are safe since the pipelines map is atomically swapped, and if a cluster update happens while iterating over stats (now read directly from the pipeline) a slightly stale view of stats may be shown.
Prior to this change when a pipeline processor called another pipeline, only the stats for the first processor were recorded. The stats for the subsequent pipelines were ignored. This change properly accounts for pipelines irregardless if they are the first or subsequently called pipelines. This change moves the state of the stats from the IngestService to the pipeline itself. Cluster updates are safe since the pipelines map is atomically swapped, and if a cluster update happens while iterating over stats (now read directly from the pipeline) a slightly stale view of stats may be shown.
6.5 backport: 8e72a68 |
Prior to this change when a pipeline processor called another pipeline, only the stats for the first processor were recorded. The stats for the subsequent pipelines were ignored. This change properly accounts for pipelines irregardless if they are the first or subsequently called pipelines. This change moves the state of the stats from the IngestService to the pipeline itself. Cluster updates are safe since the pipelines map is atomically swapped, and if a cluster update happens while iterating over stats (now read directly from the pipeline) a slightly stale view of stats may be shown.
Prior to this change when a pipeline processor called another
pipeline, only the stats for the first processor were recorded.
The stats for the subsequent pipelines were ignored. This change
properly accounts for pipelines irregardless if they are the first
or subsequently called pipelines.
This change moves the state of the stats from the IngestService
to the pipeline itself. Cluster updates are safe since the pipelines
map is atomically swapped, and if a cluster update happens
while iterating over stats (now read directly from the pipeline)
a slightly stale view of stats may be shown.
With the following setup:
Before this change:
^^ Notice that pipeline2/3 were called but the stats do not reflect that they are.
After this change:
Note, before and after this change the time_in_millis includes the children pipelines. It is difficult to emulate outside of unit testing ... but if pipeline3 takes 3ms, and pipeline2 takes 2 milliseconds, and pipeline1 takes 1 ms, (assuming pipeline1->pipelin2->pipeline3) then pipeline1 = 6ms (1 + 2 + 3) , pipeline 2 = 5ms (2 + 3), and pipeline 3ms.
Also apologizes for the noise around the imports, but if understand correctly the updated versions is the preferred order. (and it is kinda painful to manage them manually or continually change my IDE settings)