Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity enable #530

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

TheFixer
Copy link
Contributor

@TheFixer TheFixer commented May 29, 2020

This PR implements delayed enabling of participants, publishers, subscribers, readers and writers using dds_enable() call. Delayed enabling allows qos changes to entities that before they are enabled, and that would result in DDS_RETCODE_IMMUTABLE_POLICY after the entity has been enabled.

In particular, this PR contains the following

  1. The entity_factory.autoenable_created_entities QoS is implemented for participants, publishers and subscribers. This QoS indicates whether child entities for these must be automatically enabled (default) or disabled when created.
  2. Implementation of the API call dds_enable() to enable entities that have been created disabled. This operation is a noop when applied to an already enabled entity
  3. Delayed creation of readers and writers in ddsi so that disabled readers and writers are not discovered
  4. A bunch of tests cases to validate the implementation of this feature

The following issues are still present:

  1. Since there is no factory for participants in cyclone that has the autoenable qos it is currently impossible to create a disabled participant. Ofcourse this functionality can be built, but I just did not do it (yet)
  2. And perhaps worse, there is one situation were I still encounter problems. This has to do with deleting a participant while there are disabled entities, this gives a timeout. There are test cases added to demonstrate this problem.

@eboasson
Copy link
Contributor

@TheFixer I'm sorry it is taking so long to review. It hasn't slipped through the cracks, it is just that it needs and deserves a careful look.

@TheFixer
Copy link
Contributor Author

No problem, @eboasson . I am still struggling a bit with properly cleaning up entities that have not been enabled yet. I suggest I keep struggling for a while while you spent your valuable time on more urgent issues.

@TheFixer
Copy link
Contributor Author

@eboasson This PR implements delayed enabling of entities. A more detailed description of what has been done can be found in https://github.com/TheFixer/cyclonedds/commit/18f39fca458f365981b34e5928e6768148e0fe64. Would it be possible to review this PR?

@TheFixer
Copy link
Contributor Author

TheFixer commented Feb 4, 2021

It has been a few months ago that I spent a considerable amount of time to attempt to implement the dds_enable() and get it running on the (then current) master. So far I have not received any comments. Therefore, I am about to close this pull request.

@TheFixer TheFixer closed this Feb 4, 2021
@eboasson
Copy link
Contributor

eboasson commented Feb 5, 2021

@TheFixer I am very sorry that it got ignored for so long. I don't know why I missed this:

@eboasson This PR implements delayed enabling of entities. A more detailed description of what has been done can be found in TheFixer/cyclonedds-old@18f39fc. Would it be possible to review this PR?

but it sure hasn't helped that it got lost somewhere in either my mailbox or my memory. As to the relevance of the PR, it is a rather important bit considering that content filtering is something that should happen at the reader level (rather than via "content filtered topics") and that the "enable" functionality is a key part of making it possible to set a content a filter on a reader before it starts receiving data. (The alternative would be setting it via a QoS, but that has some other downsides, at least at this time.)

So allow me to reopen the PR and make a priority of doing the review it deserves. We'll see how large the impact is of the changes that have been merged into master since then, but as they don't involve major changes to the basics of creating or deleting entities, I have good hopes it won't be too bad.

And next time, if you don't hear anything and are wondering what's going on, please don't wait so long and just ping me!

@eboasson eboasson reopened this Feb 5, 2021
Signed-off-by: TheFixer <thefixer@iteazz.com>

This pull request implements delayed enabling of entities, in particular
of publishers, subscribers, readers and writers. Delayed enabling allows
immutable qos changes to be changed as long as the entity is not enabled,
but once the entity is enabled immutable qos changes are not allowed anymore.

Delayed enabling is implemented by deferring the creation of ddsi entities as
long as the entity is not enabled. The application can call dds_enable() on a
disabled entity to enable it. Calling dds_enable() on an already enabled entity
is considered a noop for the application.

This pull request implements the following behaviour:

- Participants are always enabled.
- Created entities will be automatically enabled when their parent entity is enabled,
  and their parent has autoenable set to true. This is the default.
- Calling dds_create_writer() on a disabled publisher which has autoenable=true will
  return a dds writer without a corresponding ddsi writer (and similar for readers)
- Disabled entities can be enabled by calling dds_enable() on the entity.
  When the parent of the entity is not enabled, enabling will fail and return with
  DDS_RETCODE_PRECONDITION_NOT_MET. When the entity cannot be enabled because of qos
  mismatches or invalid security credentials, an error code will be returned.
  The call dds_enable() will return DDS_RETCODE_OK if is has enabled the entity
  and alls its disabled children that have been created while their parent has
  autoenable=true. If not all children are successfully enabled then dds_enable()
  will return with the error code of the first child that that could not be enabled.
- Calling dds_enable() on an already enabled entity is a noop.
- The m_iid is pre-allocated when a dds_entity is created, and is reused when the
  corresponding ddsi entity gets enabled.
- Immutable QoS settings can be changed on an entity as long as the entity is not enabled.
  As a consequence the creation of some qos dependent structures (like whc) have been
  moved from the dds entity to the ddsi entity.
- The whc is created when entity is enabled, not prior to that. The same holds for the
  xpack creation.
- When readers that have no predefined rhc, the rhc will be created when the reader gets
  enabled (i.e., the call to dds_rhc_default_new() is done when the entity is enabled)
- Readers that do have a predefined rhc will use this rhc when becoming enabled.
- thread_state_awake() and thread_state_sleep() functions have moved to the location where
  entities are enabled.
- The check for the availability of the security credentials has been moved to location
  where entities are enabled. This implies that a disabled dds writer or reader can be
  created before checking its security credentials. To not change legacy behaviour the
  creation of a dds reader or writer fails with DDS_RETCODE_NOT_ALLOWED_BY_SECURITY
  when the reader or writer is automatically enabled when created.
  Note that this can only happen for writers whose parent is enabled and have autoenabled
  set to true.
- If an entity (e.g., subscriber) is enabled which causes recursive calls to dds_enable()
  on multiple children (e.g., readers), and some of these child entities are successfully
  enabled and others are not (e.g., because the some of the readers have the correct security
  credentials and others have not) then the dds_enable() on the parent will fail. It is
  the responsibility of the application to find out which of the entities have been
  enabled, and which have not.
- When a reader is created with an implicit subscriber, then the implicit subscriber is
  enabled if and only if the autoenable setting of the participant is set to true.
  Similarly for the creation of a writer with an implicit publisher.
- Various tests have been implemented.

Notes for the reviewer.
- If a participant has autoenable=false and an reader (or writer) with an implicit
  subscriber (or publisher) is created, I don't think it is possible to enable the
  implicit subscriber (or publisher)( anymore. I therefore think it would be wise
  to prevent the creation of a reader (or writer) with an implicit subscriber
  (or publisher) when the participant has autoenable=false, and return
  DDS_RETCODE_PRECONDITION_NOT_MET in that case. Do you agree?
- Currently, I don't use locking when recursively enabling children. Please check.
- I am not sure, but I have the impression that some security tests take much time.
  Not sure if it is caused by this pull request.
@eboasson
Copy link
Contributor

eboasson commented Feb 8, 2021

@TheFixer the set of conflicts appeared small enough to make it worthwhile rebasing it before doing the review (the advantage is that I have keep track of fewer versions in my head), and so I did just that. I've made my rebased version available at https://github.com/eboasson/cyclonedds/tree/thefixer-entity-enable. The tests still pass locally.

A quick look at the changes I made suggests that it won't affect the review in a major way, so I can give review comments on the current version of this PR without complicating things. However, I think it would make the most sense to update the PR first, provided you agree with the way I resolved the conflicts in the rebased version.

@TheFixer
Copy link
Contributor Author

TheFixer commented Feb 9, 2021

Thanks @eboasson for taking the effort. I'll have a look at the rebase but don't expect any issues. I let you know when the updated pull request is available.

@TheFixer
Copy link
Contributor Author

@eboasson I inspected the changes caused by the rebase and did not see any problems. The PR is now updated.

@eboasson
Copy link
Contributor

@TheFixer thanks, that makes doing the review a bit easier. (And leaves me no excuse for procrastination!)

Copy link
Contributor

@eboasson eboasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheFixer as was to be expected, the review results in some comments. Some are really minor matters, some are more substantial.

What I haven't checked is whether indeed all functions that should fail if the entity they are invoked isn't enabled yet correctly return that error. I'll do that in a second round, I suppose.

Regarding the tests, I cannot help but wonder whether it would not be better to collect all "enable" tests in a single file and structure them a bit differently because a large overlap in the set of checks for the different entity kinds, and a bit of generic code along with some small functions to construct entities would likely result in significantly less code covering exactly same the ground.

Finally, I think that especially for the reader/writer tests it would make sense to also check that the built-in topics come out as they should (generated at the correct time, correct instance handles, stuff like that) and that remote nodes indeed observe the discovery events as a consequence of enabling the entity. (The latter is pretty obvious looking at the code, but the former not so much.)

* The dds_enable() operation will return DDS_RETCODE_OK if an entity and all its
* autoenabled children have been enabled successfully. In case one or more of
* its autoenabled children could not be enabled, an error code is returned.
* It is the responsibility of the application to find out which of the children
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of this model — I'd rather see no change at all when the operation fails — but considering that it is a relatively rarely used operation and that I imagine failure will usually be treated as a grounds for program termination, I think this is fine for a first iteration.

@@ -202,6 +202,23 @@ static struct ddsi_tkmap_instance *dds__builtin_get_tkmap_entry (const struct dd
return tk;
}

static struct ddsi_tkmap_instance *dds__builtin_get_tkmap_entry_with_iid (const struct ddsi_guid *guid, const uint64_t iid, void *vdomain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function serves a rather unique purpose: new local entities now need to have an iid before being enabled, and in the way it is implemented in this PR before the GUID is assigned. In all other cases the iid gets assigned on mapping the key to an instance id, and for entities that GUID is the key.

Technically, the possibility exists that a GUID gets reused, there are after all but 2^24 entity ids (in Cyclone, arguably different entity kinds could use the same entity id). So a process that creates 2^24 entities (but for some rounding errors), deletes one of them (so there is again an entity id available), and then creates another entity will necessarily reuse the entity id of the entity that was just deleted. However the GUID-to-iid mapping may outlive the entity that was deleted.

Consequently, in the design in this PR, come the call dds__builtin_get_tkmap_entry_with_iid on the creation of the second entity with GUID guid, if the old GUID-to-iid mapping for that GUID still exists, you end up in a state where the mapping retains the old iid, whereas the entity object has the new iid. That'll lead to an observable difference (somewhere, I can't be bothered to figure out all the details). I am sure you'll agree this is an edge case, but edge cases are usually more interesting than regular ones 🙂.

Methinks the solution lies in not introducing this new path, but rather moving the GUID allocation (e.g., in new_reader) up in the stack (e.g., to dds_create_reader, or perhaps even to dds_entity_init) so that the GUID-to-iid mapping can always follow the same path and the potential discrepancy avoided.

(This has various effects throughout the PR, I'm not going to mark them, that's not worth the bother.)

e->m_flags |= DDS_ENTITY_ENABLED;
DDS_CERROR (&e->m_domain->gv.logconfig, "Delayed entity enabling is not supported\n");
}
rc = dds_entity_deriver_enable(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does make some sense (enabling is definitely specific to the entity kind), but it doesn't actually dispatch to an implementation for all entity kinds (e.g., what about topics? — and if those don't need any special handling, dds_entity_deriver_dummy_enable simply returns "illegal operation", so is a rather incomplete operation).

/* enable all disabled children with the flag DDS_ENTITY_ENABLE_ON_PARENT */
for (c = ddsrt_avl_iter_first (&dds_entity_children_td, &e->m_children, &it); c != NULL; c = ddsrt_avl_iter_next (&it)) {
if ((!(c->m_flags & DDS_ENTITY_ENABLED)) && (c->m_flags & DDS_ENTITY_ENABLE_ON_PARENT)) {
ret = dds_enable((dds_entity_t)c->m_hdllink.hdl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invoked by the entity kind-specific function on e while e's entity lock is held, but that means it locks the children's entity locks while holding the parent's entity lock, which breaks the rules. Perhaps it would have been better to allow locking a child entity while holding the parent's lock, rather than the other way around, but today you can only easily walk "upward" in the hierarchy.

So here is a risk of deadlock, however small (there are rather few operations that would actually try to lock a disabled entity, but certainly calling dds_enable on a publisher and on its writers in parallel is one way.)

The trick to do such a walk over entities typically involves unlocking & relocking the parent entity. See, for example, pushdown_pubsub_qos.

e->m_flags |= DDS_ENTITY_ENABLE_ON_PARENT;
/* do not enable if parent is not enabled or no autoenable */
if ((!(parent->m_flags & DDS_ENTITY_ENABLED)) || ((e->m_flags & DDS_ENTITY_ENABLE_ON_PARENT) == 0))
e->m_flags &= ~DDS_ENTITY_ENABLED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rewriting the logic to set ENABLED once if and only if it really is enabled is much clearer than setting it and clearing it if turns out that it shouldn't be set.

thread_state_asleep (lookup_thread_state ());
#endif
(void) dds_delete (reader);
goto del_implicit_sub;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against goto for error handling, but this moves just a little too far in the direction of "unstructured" programming to my taste!

(Also for the writer.)

* returns PRECONDITION_NOT_MET */
if ((pub->m_entity.m_parent != NULL) && !dds_entity_is_enabled(pub->m_entity.m_parent))
return DDS_RETCODE_PRECONDITION_NOT_MET;
pub->m_entity.m_flags |= DDS_ENTITY_ENABLED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests, setting the ENABLED flag and propagating to children, are a fundamental part of enabling entities. So in my view, these particular parts of enabling an entity should be part of the generic dds_enable, and not be part of the entity kind-specific functions.

/* FIXME: not freeing WHC here because it is owned by the DDSI entity */

/* not freeing WHC here because it is owned by the DDSI entity */
#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any value in leaving this #if 0/#endif block in.

@@ -501,6 +501,9 @@ void rd_wr_init_fail(
goto fail;
dds_set_status_mask (*rd, DDS_SUBSCRIPTION_MATCHED_STATUS);
}
dds_delete_qos (qos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three lines don't add anything of value, it means the same code is executed under all circumstances. I prefer the simpler variant.

@@ -448,7 +448,7 @@ CU_TheoryDataPoints(ddssec_secure_communication, multiple_readers) = {
CU_DataPoints(size_t, 1, 3, 1, 3), /* number of participants per domain */
CU_DataPoints(size_t, 3, 1, 3, 3), /* number of readers per participant */
};
CU_Theory((size_t n_dom, size_t n_pp, size_t n_rd), ddssec_secure_communication, multiple_readers, .timeout = 90, .disabled = false)
CU_Theory((size_t n_dom, size_t n_pp, size_t n_rd), ddssec_secure_communication, multiple_readers, .timeout = 120, .disabled = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why implementing dds_enable would affect the duration of this test. I'd prefer not to increase the timeout if there is no need, but I'm open to the possibility that 90s is just a bit short.

@TheFixer
Copy link
Contributor Author

TheFixer commented Feb 19, 2021

@eboasson Thanks for the review. I'll rework the PR and let you know when ready.

@poetinger poetinger added the needs-triage Contributer attention required label Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Contributer attention required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants