Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLD document on add/remove ports dynamically feature #900

Merged
merged 10 commits into from
Feb 24, 2022

Conversation

tomer-israel
Copy link
Contributor

@tomer-israel tomer-israel commented Nov 11, 2021

@tomer-israel
Copy link
Contributor Author

praveen-li comments are located here:
tomer-israel#1

@tomer-israel
Copy link
Contributor Author

The comments from tomer-israel#1:

doc/port-add-del-dynamically/dynamic_port_add_del_hld.md
3. Portsyncd is adding the new port info to App DB
4. On portsorch (orchagent) - Port set event is received from App DB.
5. Portsorch is creating the port on SAI.
6. SDK is creating the port and the host interfaces.

@praveen-li 20 days ago
We can add complete steps [not necessary though]...I mean b/w SDK and Netlink.

Owner
Author
@tomer-israel tomer-israel 9 days ago
i will add the complete steps

  • Del port: Receive del port operation from port config table, remove this port from APP DB.

Sflowmgr:

Add port: Event from config db - Update the speed to sflow internal db.

@praveen-li praveen-li 20 days ago
Did not get this part , we need to remove speed or polling rate?

Owner
Author
@tomer-israel tomer-israel 9 days ago
its a description of what sflowmgr is doing as a result of a change on config db ports table - it's not a change that we are planning to add.
I will try to change it in a way that it will be more clear.

sonic-net/sonic-buildimage#8422

sonic-net/sonic-platform-daemons#212

snampagent:

@praveen-li praveen-li 20 days ago
Say is SNMP data is fetcched via a remote client for this port and later on port is deleted, what will be MIB entries....STALE or REMOVED

Owner
Author
@tomer-israel tomer-israel 9 days ago
after removing a port the MIB will remove also. the fetched snmp data should be without this port.

We have also possible way for race condition:

possible buffermgr delete port race condition

@praveen-li praveen-li 20 days ago
I feel, there will be multiple such problems, can we change portsorch code to not remove "it entry in map" for buffer till port is added. Even after that we may have problems, If del port is followed by add Port immediately. We may need to validate the config in that case.

Owner
Author
@tomer-israel tomer-israel 9 days ago
"..can we change portsorch code to not remove "it entry in map" for buffer till port is added.." - what is it mean?

" If del port is followed by add Port immediately. We may need to validate the config in that case." - why del port and add port immediately is problematic? can you explain why do you think we need to validate from config?

Xcvrd, buffermgrd, natmgr, natsync – waiting for PortInitDone

## Init types:
The Dynamic port add/remove configuration will be supported for both types of init types:<br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both -> all ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be changed


## Init with zero ports:
Starting with zero ports requires new SKU for zero ports with these changes:<br />
**Port_confg.ini** – without entries<br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the port_config.ini deprecated and not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be removed

**Port_confg.ini** – without entries<br />
**Hwsku.json** – without interfaces<br />
**Platform.json** – without interfaces<br />
**Sai xml** file – needs to be without port entries. <br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make it more generic name like sai profile? some platforms would have other file than xml file.
Also, do we even have intention to have sai profile without ports? It makes sense to have superset of the ports in profile and up to SONiC to decide what to expose like host interfaces, admin up on ports etc..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it to sai profile
on our sai profile we have all the ports configured, but probably in the future, I will consult with the team and it will be changed.

# Initialization stage


![Init stage](images/init_stage_diagram.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diagram is very hard to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Enlarged the image


#### Add port:

![Add port](images/add_port_diagram.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diagram is very hard to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Enlarged the image

9. Portsyncd will remove the port entry from state db


## Modules that “listen” to changes on config port table & App port table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also stateDB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also state db, I will add it

**No need to change the code**

#### Teammgrd:
Listen to events from state db, when entry is added -> add the port to lag<br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to complete the picture, mention the case when port is removed from config db?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be added

• If the portsyncd is “quicker” than the buffermgr the orchagent will try to remove the port from SAI before the buffer configuration was removed.<br />
• Need to test this scenario in order to check if this race condition is reproducing or it’s rare scenario<br />
• Solution for this: <br />
Need to add to orchagent the ability to add the buffer configuration of a port and increase a reference counter for each port, in the same way ACL cfg on port is working. We already have infrastructure for this just need to add the buffer cfg to use it. If a port has with buffer cfg on – this port will not be removed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have retry logic implemented before in case port was in-use, see: https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L3312 , so I think this condition should be handled unless SAI is not implemented right with correct return code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this retry implementation, but after trying the current implementation and the proposed one, i prefer the proposed one from these reasone:

in the current implementation we will receive endless orchagent/SAI error messages:

...
Nov 30 11:05:38.892675 r-bulldog-04 NOTICE swss#orchagent: :- doPortTask: Deleting Port Ethernet0
Nov 30 11:05:38.895147 r-bulldog-04 ERR swss#orchagent: :- meta_is_object_in_default_state: object oid:0x1a000000000005 has non default state on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE: oid:0x190000000004c8, expected NULL
Nov 30 11:05:38.895147 r-bulldog-04 ERR swss#orchagent: :- meta_port_remove_validation: port related object oid:0x1a000000000005 is not in default state, can't remove
Nov 30 11:05:38.895195 r-bulldog-04 WARNING swss#orchagent: :- doPortTask: Failed to remove port 1000000000170, as the object is in use
Nov 30 11:05:38.895230 r-bulldog-04 NOTICE swss#orchagent: :- doPortTask: Deleting Port Ethernet0
Nov 30 11:05:38.897932 r-bulldog-04 ERR swss#orchagent: :- meta_is_object_in_default_state: object oid:0x1a000000000005 has non default state on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE: oid:0x190000000004c8, expected NULL
Nov 30 11:05:38.897932 r-bulldog-04 ERR swss#orchagent: :- meta_port_remove_validation: port related object oid:0x1a000000000005 is not in default state, can't remove
Nov 30 11:05:38.897932 r-bulldog-04 WARNING swss#orchagent: :- doPortTask: Failed to remove port 1000000000170, as the object is in use
Nov 30 11:05:38.897932 r-bulldog-04 NOTICE swss#orchagent: :- doPortTask: Deleting Port Ethernet0
Nov 30 11:05:38.900494 r-bulldog-04 ERR swss#orchagent: :- meta_is_object_in_default_state: object oid:0x1a000000000005 has non default state on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE: oid:0x190000000004c8, expected NULL
Nov 30 11:05:38.900494 r-bulldog-04 ERR swss#orchagent: :- meta_port_remove_validation: port related object oid:0x1a000000000005 is not in default state, can't remove
...

In the suggested fix we don't have these errors, and the ref count warning message is printed only once

Also, I noticed that the SAI permits to remove the port even when there are few buffer_pg configurations, for example:
removing "BUFFER_PG|Ethernet8|0" will cause the port to be removed even if "BUFFER_PG|Ethernet8|3-4" wasn't removed yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit surprised that the SAI error messages were continuous with current implementation. By design it should retry periodically (e,g, every 1 second). and typical timing issue should be resolved after a few tries. The retry logic should be the same with the ref counter change.

If SAI does not track some of the dependencies, it sounds like a sai redis api bug to me. we should open a issue.

It might be case by case issue, but in general, we should let SAI redis layer to handle the dependencies as that is the final gate anyways. it would be easier this way to onboard any new features with DPB.

@lguohan Any comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand.
I wanted the buffer configuration to be the same as ACL/VLAN/INTERFACE configuration, which uses the ref counter for the dependencies, and before removing a port we check this ref counter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this caveat into the doc itself, so we know this was the choice we made and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

• When port is added to the config db – the speed and the admin state is saved on internal db<br />
• After port was added the user can add buffer configuration to this port (dynamic or static configuration) and only then the buffermgr will set the buffer configuration on App table<br />

• We have rare situation of race condition in the add port flow:<br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when applying buffer cfg into port, we should always check if port is ready first, so this race condition should be handled. If this check is not done, we should do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no check if port exist, we can check it before trying to add buffer cfg, if it is not exist need to retry.
I will add it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any description added yet to the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to the document




Suggested change:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what problem you are trying to address here.

Copy link
Contributor Author

@tomer-israel tomer-israel Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a description for it

@liat-grozovik
Copy link
Collaborator

@zhenggen-xu could you please check recent update following your comments?

@zhenggen-xu
Copy link
Collaborator

Please resolve/mark the things you have addressed.

@@ -0,0 +1,301 @@
# Delete or remove ports dynamically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we change the HLD title to "dynamic port add and del enhancements" or "Enhancements to add or del ports dynamically", so it is clear that we are fixing a few cases that need to be addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to "Enhancements to add or del ports dynamically"


• We have rare situation of race condition in the add port flow:<br />

![possible buffermgr race condition](images/buffermgr_possible_race.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enlarged the image


**Note:** This is a new type of init that was never tested and will be supported.<br />
The zero-port system is a special case of this feature. <br />
Few PRs were already added in order to support zero ports init.<br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR links?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be added to document

**No need to change the code**

#### PMON - Xcvrd:
Listen to events on cfg port table and update transeiver information <br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have design doc for this? how other platforms can leverage this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the PR for this:
sonic-net/sonic-buildimage#8422

I will add it to the document

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for the pmon - xcvrd design document, I still haven't seen it in the latest doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I'm not sure if we have a design document for it
@Junchao-Mellanox , do we have a design document for the xcvrd changes of the dynamic port configuration ?

• When port is added to the config db – the speed and the admin state is saved on internal db<br />
• After port was added the user can add buffer configuration to this port (dynamic or static configuration) and only then the buffermgr will set the buffer configuration on App table<br />

• We have rare situation of race condition in the add port flow:<br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any description added yet to the doc.

• If the portsyncd is “quicker” than the buffermgr the orchagent will try to remove the port from SAI before the buffer configuration was removed.<br />
• Need to test this scenario in order to check if this race condition is reproducing or it’s rare scenario<br />
• Solution for this: <br />
Need to add to orchagent the ability to add the buffer configuration of a port and increase a reference counter for each port, in the same way ACL cfg on port is working. We already have infrastructure for this just need to add the buffer cfg to use it. If a port has with buffer cfg on – this port will not be removed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit surprised that the SAI error messages were continuous with current implementation. By design it should retry periodically (e,g, every 1 second). and typical timing issue should be resolved after a few tries. The retry logic should be the same with the ref counter change.

If SAI does not track some of the dependencies, it sounds like a sai redis api bug to me. we should open a issue.

It might be case by case issue, but in general, we should let SAI redis layer to handle the dependencies as that is the final gate anyways. it would be easier this way to onboard any new features with DPB.

@lguohan Any comments?

add the reason why we chose the ref counter mechanism on Buffer configuration
@liat-grozovik
Copy link
Collaborator

liat-grozovik commented Jan 15, 2022

@zhenggen-xu kindly reminder. Any further comments or should we go ahead and merge?

@liat-grozovik
Copy link
Collaborator

@zhenggen-xu kindly reminder. Any further comments or should we go ahead and merge?
@lguohan how can we proceed with the HLD and the PRs? there are available for quite some time and we don't have major progress on the review/approval flow in the last 2-3weeks

@liat-grozovik
Copy link
Collaborator

@zhenggen-xu I assume if no further comments, the HLD PR can be merged. Let me know if you have any additional feedback and if not the HLD will be assumed as approved. Thanks.

@liat-grozovik
Copy link
Collaborator

as no additional feedback and seems like all the comments were addressed, the PR is merged.

@liat-grozovik liat-grozovik merged commit 1ffb3d4 into sonic-net:master Feb 24, 2022
liat-grozovik pushed a commit to sonic-net/sonic-swss that referenced this pull request Apr 7, 2022
… added or removed (#2019)

- What I did
Add support for adding & removing port counters dynamically each time port was added or removed the counters that were supported are:

1. pg watermark counters
2. pg drop counters
3. queue stat counters
4. queue watermark counters
5. debug counters
6. buffer drop counters and port stat counters are already supported to be added or removed each time port is added/removed

Implemented according to the - 'HLD document on add/remove ports dynamically feature' sonic-net/SONiC#900

- Why I did it
In order to support dynamically add or removed ports on sonic

- How I verified it
tested manually that the flex counters were added or removed correctly whenever we add or remove ports
added new test cases to the following vs tests:

test_flex_counters.py
test_drop_counters.py
test_pg_drop_counter.py

Co-authored-by: dprital <drorp@nvidia.com>
dprital added a commit to dprital/sonic-swss that referenced this pull request May 8, 2022
… added or removed (sonic-net#2019)

- What I did
Add support for adding & removing port counters dynamically each time port was added or removed the counters that were supported are:

1. pg watermark counters
2. pg drop counters
3. queue stat counters
4. queue watermark counters
5. debug counters
6. buffer drop counters and port stat counters are already supported to be added or removed each time port is added/removed

Implemented according to the - 'HLD document on add/remove ports dynamically feature' sonic-net/SONiC#900

- Why I did it
In order to support dynamically add or removed ports on sonic

- How I verified it
tested manually that the flex counters were added or removed correctly whenever we add or remove ports
added new test cases to the following vs tests:

test_flex_counters.py
test_drop_counters.py
test_pg_drop_counter.py

Co-authored-by: dprital <drorp@nvidia.com>
@zhangyanzhao
Copy link
Collaborator

sonic-net/sonic-swss#2194 need be added into the code PR list to track. @tomer-israel

preetham-singh pushed a commit to preetham-singh/sonic-swss that referenced this pull request Aug 6, 2022
… added or removed (sonic-net#2019)

- What I did
Add support for adding & removing port counters dynamically each time port was added or removed the counters that were supported are:

1. pg watermark counters
2. pg drop counters
3. queue stat counters
4. queue watermark counters
5. debug counters
6. buffer drop counters and port stat counters are already supported to be added or removed each time port is added/removed

Implemented according to the - 'HLD document on add/remove ports dynamically feature' sonic-net/SONiC#900

- Why I did it
In order to support dynamically add or removed ports on sonic

- How I verified it
tested manually that the flex counters were added or removed correctly whenever we add or remove ports
added new test cases to the following vs tests:

test_flex_counters.py
test_drop_counters.py
test_pg_drop_counter.py

Co-authored-by: dprital <drorp@nvidia.com>
liat-grozovik pushed a commit to sonic-net/sonic-swss that referenced this pull request Aug 29, 2022
…ter (#2194)

- What I did
This PR replace PR #2022

Added increasing/decreasing to the port ref counter each time a port buffer configuration is added or removed
Implemented according to the - sonic-net/SONiC#900

- Why I did it
In order to avoid cases where a port is removed before the buffer configuration on this port were removed also

- How I verified it
VS Test was added in order to test it.
we remove a port with buffer configuration and the port is not removed. only after all buffer configurations on this port were
removed - this port will be removed.
yxieca pushed a commit to sonic-net/sonic-swss that referenced this pull request Sep 1, 2022
…ter (#2194)

- What I did
This PR replace PR #2022

Added increasing/decreasing to the port ref counter each time a port buffer configuration is added or removed
Implemented according to the - sonic-net/SONiC#900

- Why I did it
In order to avoid cases where a port is removed before the buffer configuration on this port were removed also

- How I verified it
VS Test was added in order to test it.
we remove a port with buffer configuration and the port is not removed. only after all buffer configurations on this port were
removed - this port will be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants