-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warm reboot: restore the database docker with content saved #2216
Conversation
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Co-Authored-By: qiluo-msft <qiluo-msft@users.noreply.github.com>
if [[ "$REBOOT_TYPE" == "warm" && -d /host/warmboot ]]; then | ||
WARM_DIR=/host/warmboot | ||
function redisLoadAndDelete() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function needs to also take database ID as a parameter #Resolved
function redisLoadAndDelete() | ||
{ | ||
FILENAME="$1" | ||
test -e $FILENAME && redis-load -s /var/run/redis/redis.sock -e EMPTY $FILENAME && rm $FILENAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few issues from test:
- rm always fail in this function. you need to issue "sudo rm" to get it to work.
- "-s /var/run/redis/redis.sock" cause import to fail always. Removing this option works better.
- import fails randomly. I am stilling looking for a way to make it working reliably. This service is crucial that it has to be reliable.
- I think you shouldn't use '&&' notation. We want to remove these files regardless import succeeded or not. right? I don't think we should retry warm-boot if any failure was encountered. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought:
Maybe we should catch these db restore failures and in case of failure, clear the database and continue with a regular boot up? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- rm fixed
- redis-load fixed. if any more failure case, let me know
- I cannot agree to make it retry blindly. I make it exit immediately and we should fix if there is error in normal case.
In reply to: 229778573 [](ancestors = 229778573)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it exit immediately and we should fix if there is error in normal case.
In reply to: 229780996 [](ancestors = 229780996)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that if we fail database service in product, the device will be in failed state but ASIC is still forwarding. I am not sure if this is better than coming up with cold start and suffer a short IO disruption?
Co-Authored-By: qiluo-msft <qiluo-msft@users.noreply.github.com>
Co-Authored-By: qiluo-msft <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
84b2815
to
f6c7a64
Compare
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
@qiluo-msft , can you provide description for you commit? #Resolved |
echo $1 | python -c "import sys, json, os; mnts = [x for x in json.load(sys.stdin)[0]['Mounts'] if x['Destination'] == '/usr/share/sonic/hwsku']; print '' if len(mnts) == 0 else os.path.basename(mnts[0]['Source'])" 2>/dev/null | ||
} | ||
|
||
function getRebootType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getBootType #Resolved
} | ||
|
||
function postStartAction() | ||
{ | ||
REBOOT_TYPE=`getRebootType` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOOT_TYPE #Resolved
$SUDO rm $FILENAME || exit 12 | ||
} | ||
# Load applDB from /host/warm-reboot/appl_db.json | ||
redisLoadAndDelete $WARM_DIR/appl_db.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the DB argument? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as comments.
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
# Load stateDB from /host/warm-reboot/state_db.json | ||
redisLoadAndDelete 6 $WARM_DIR/state_db.json | ||
# Load asicDB from /host/warm-reboot/asic_db.json | ||
redisLoadAndDelete 1 $WARM_DIR/asic_db.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing came to my mind: I think we should test all file existence before proceeding with restoration. If any file is missing, there is something wrong. We should restore all or nothing. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation treat this case as a service start failure. Later we can refine the case with robust recovery.
This submodule update brings in the following changes: ``` 50d5be2 Make changes to support compiling on Bullseye with GCC 10 (sonic-net#2216) 0870cf5 [mirrororch]: Implement HW resources availability validation for SPAN/ERSPAN (sonic-net#2187) f4ec565 [vlanmgrd] fix use-after-free memory issue (sonic-net#2211) c2de7fc [QosOrch] The notifications cannot be drained in QosOrch in case the first one needs to retry (sonic-net#2206) 5575935 [neighsyncd] increase neighsyncd timeout (sonic-net#2209) 0f06910 [PBH] Implement Edit Flows (sonic-net#2169) 6241bbf Remove redundant and problematic code to skip "pool" field in buffer profile handling (sonic-net#2197) a55343c [azp]: Set diff coverage threshhold to 80% (sonic-net#2188) 390cae1 [portsorch]: Prevent LAG member configuration when port has active ACL binding (sonic-net#2165) c1d47e6 [VNET]Fixing nexthop group delete during route change (sonic-net#2198) 8941cc0 [BFD]Registering BFD state change callback during session creation (sonic-net#2202) 680c539 [vxlan] Remove tunnel map objects on VNET tunnel removal (sonic-net#2150) 20dde0c Fix for handling broadcom DNX ASIC to have ipv4 and ipv6 ACL rules in separate tables. (sonic-net#2178) 5b7c949 [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (sonic-net#2200) 7350d49 [Vxlanmgr] vnet netdev cleanup during config reload fix (sonic-net#2191) 2bef62b Validate LAG has members before mirror session create (sonic-net#2130) 1e4d4ce [VS test] Increase VS test time, skip dpb flaky test (sonic-net#2195) 6eda965 [vstest]Migrating vs tests from using click commands to direct DB access (sonic-net#2179) ``` Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
50d5be2 (HEAD, origin/master, origin/HEAD) Make changes to support compiling on Bullseye with GCC 10 (sonic-net#2216) 0870cf5 [mirrororch]: Implement HW resources availability validation for SPAN/ERSPAN (sonic-net#2187) f4ec565 [vlanmgrd] fix use-after-free memory issue (sonic-net#2211) c2de7fc [QosOrch] The notifications cannot be drained in QosOrch in case the first one needs to retry (sonic-net#2206) 5575935 [neighsyncd] increase neighsyncd timeout (sonic-net#2209) 0f06910 (master) [PBH] Implement Edit Flows (sonic-net#2169) 6241bbf Remove redundant and problematic code to skip "pool" field in buffer profile handling (sonic-net#2197) a55343c [azp]: Set diff coverage threshhold to 80% (sonic-net#2188) 390cae1 [portsorch]: Prevent LAG member configuration when port has active ACL binding (sonic-net#2165) c1d47e6 [VNET]Fixing nexthop group delete during route change (sonic-net#2198) 8941cc0 [BFD]Registering BFD state change callback during session creation (sonic-net#2202) 680c539 [vxlan] Remove tunnel map objects on VNET tunnel removal (sonic-net#2150) 20dde0c Fix for handling broadcom DNX ASIC to have ipv4 and ipv6 ACL rules in separate tables. (sonic-net#2178) 5b7c949 [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (sonic-net#2200) 7350d49 [Vxlanmgr] vnet netdev cleanup during config reload fix (sonic-net#2191) 2bef62b Validate LAG has members before mirror session create (sonic-net#2130) 1e4d4ce [VS test] Increase VS test time, skip dpb flaky test (sonic-net#2195) 6eda965 [vstest]Migrating vs tests from using click commands to direct DB access (sonic-net#2179) Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
In order to include the following commit: 0f06910 [PBH] Implement Edit Flows (sonic-net/sonic-swss#2169) sonic-swss 50d5be2 Make changes to support compiling on Bullseye with GCC 10 (#2216) 0870cf5 [mirrororch]: Implement HW resources availability validation for SPAN/ERSPAN (#2187) f4ec565 [vlanmgrd] fix use-after-free memory issue (#2211) c2de7fc [QosOrch] The notifications cannot be drained in QosOrch in case the first one needs to retry (#2206) 5575935 [neighsyncd] increase neighsyncd timeout (#2209) 0f06910 [PBH] Implement Edit Flows (#2169) 6241bbf Remove redundant and problematic code to skip "pool" field in buffer profile handling (#2197) a55343c [azp]: Set diff coverage threshhold to 80% (#2188) 390cae1 [portsorch]: Prevent LAG member configuration when port has active ACL binding (#2165) c1d47e6 [VNET]Fixing nexthop group delete during route change (#2198) 8941cc0 [BFD]Registering BFD state change callback during session creation (#2202) 680c539 [vxlan] Remove tunnel map objects on VNET tunnel removal (#2150) 20dde0c Fix for handling broadcom DNX ASIC to have ipv4 and ipv6 ACL rules in separate tables. (#2178) 5b7c949 [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (#2200) 7350d49 [Vxlanmgr] vnet netdev cleanup during config reload fix (#2191) 2bef62b Validate LAG has members before mirror session create (#2130) 1e4d4ce [VS test] Increase VS test time, skip dpb flaky test (#2195) 6eda965 [vstest]Migrating vs tests from using click commands to direct DB access (#2179) Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
…2216) Types of changes done: * Add missing includes in header files and .cpp files * Don't use parentheses when doing list initialization in constructors * Make sure variables are initialized before first use Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Related work items: #49, #58, #107, sonic-net#247, sonic-net#249, sonic-net#277, sonic-net#593, sonic-net#597, sonic-net#1035, sonic-net#2130, sonic-net#2150, sonic-net#2165, sonic-net#2169, sonic-net#2178, sonic-net#2179, sonic-net#2187, sonic-net#2188, sonic-net#2191, sonic-net#2195, sonic-net#2197, sonic-net#2198, sonic-net#2200, sonic-net#2202, sonic-net#2206, sonic-net#2209, sonic-net#2211, sonic-net#2216, sonic-net#7909, sonic-net#8927, sonic-net#9681, sonic-net#9733, sonic-net#9746, sonic-net#9850, sonic-net#9967, sonic-net#10104, sonic-net#10152, sonic-net#10168, sonic-net#10228, sonic-net#10266, sonic-net#10288, sonic-net#10294, sonic-net#10313, sonic-net#10394, sonic-net#10403, sonic-net#10404, sonic-net#10421, sonic-net#10431, sonic-net#10437, sonic-net#10445, sonic-net#10457, sonic-net#10458, sonic-net#10465, sonic-net#10467, sonic-net#10469, sonic-net#10470, sonic-net#10474, sonic-net#10477, sonic-net#10478, sonic-net#10482, sonic-net#10485, sonic-net#10488, sonic-net#10489, sonic-net#10492, sonic-net#10494, sonic-net#10498, sonic-net#10501, sonic-net#10509, sonic-net#10512, sonic-net#10514, sonic-net#10516, sonic-net#10517, sonic-net#10523, sonic-net#10525, sonic-net#10531, sonic-net#10532, sonic-net#10538, sonic-net#10555, sonic-net#10557, sonic-net#10559, sonic-net#10561, sonic-net#10565, sonic-net#10572, sonic-net#10574, sonic-net#10576, sonic-net#10578, sonic-net#10581, sonic-net#10585, sonic-net#10587, sonic-net#10599, sonic-net#10607, sonic-net#10611, sonic-net#10616, sonic-net#10618, sonic-net#10619, sonic-net#10623, sonic-net#10624, sonic-net#10633, sonic-net#10646, sonic-net#10655, sonic-net#10660, sonic-net#10664, sonic-net#10680, sonic-net#10683
Restore the database docker with content saved during the command 'warm-reboot'. If anything failed, the database service failed immediately.