Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition during stopping of server connection. #1175

Closed
aleksostapenko opened this issue Feb 1, 2019 · 2 comments
Closed

Race condition during stopping of server connection. #1175

aleksostapenko opened this issue Feb 1, 2019 · 2 comments
Assignees
Labels

Comments

@aleksostapenko
Copy link
Contributor

The crash occurred during long tests on 9011 server (dump 201812220406):

[133672.417707] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417719] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417723] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417727] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417728] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417731] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417745] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417746] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133672.417749] [tempesta] Warning: request dropped: processing error, status 502: 192.168.0.2
[133676.171920] ------------[ cut here ]------------
[133676.176810] kernel BUG at /root/builds/tempesta/source/tempesta_fw/connection.h:449!
[133676.184846] invalid opcode: 0000 [#1] SMP PTI
[133676.189501] Modules linked in: tempesta_fw(O) tempesta_db(O) tempesta_tls(O) tempesta_lib(O) iptable_mangle xt_mark tcp_diag inet_diag fuse 8021q garp mrp stp mlx5_core llc mlx4_ib ib_core mlx4_en intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ast irqbypass ttm crct10dif_pclmul crc32_pclmul drm_kms_helper iTCO_wdt ghash_clmulni_intel iTCO_vendor_support mei_me intel_cstate intel_uncore igb ipmi_si drm dca pcspkr joydev intel_rapl_perf i2c_algo_bit mei i2c_i801 intel_pch_thermal ipmi_devintf sg ie31200_edac ipmi_msghandler video button mlx4_core devlink ptp pps_core ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sd_mod evdev hid_generic usbhid hid crc32c_intel mpt3sas xhci_pci raid_class xhci_hcd scsi_transport_sas aesni_intel aes_x86_64 crypto_simd
[133676.262538]  scsi_mod cryptd usbcore glue_helper usb_common shpchp fan thermal [last unloaded: tempesta_lib]
[133676.272771] CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G           O    4.14.0-tempesta-amd64 #1 Debian 4.14.32-tfw1-1~bpo9+1
[133676.284392] Hardware name: Supermicro Super Server/X11SSL-CF, BIOS 1.0a 01/29/2016
[133676.292384] task: ffff8ad2b3f78f80 task.stack: ffff9eecc3218000
[133676.292393] RIP: 0010:tfw_sock_srv_del_conns+0x117/0x130 [tempesta_fw]
[133676.292394] RSP: 0018:ffff9eecc321bd80 EFLAGS: 00010203
[133676.292395] RAX: 0000000000000001 RBX: ffff8ad1b72d0118 RCX: dead000000000200
[133676.292395] RDX: ffff8ad1b72d17d8 RSI: 00000000fffffe01 RDI: ffffffffc0bd2847
[133676.292396] RBP: ffff8ad1b72d00e0 R08: ffff8ad1b72d0b60 R09: 000000018024000e
[133676.292396] R10: ffff9eecc321bca0 R11: 0000000000000000 R12: ffff8ad1ade480c8
[133676.292397] R13: ffff8ad1b72d17a0 R14: ffff8ad1ade480b8 R15: ffffffffc102e858
[133676.292398] FS:  0000000000000000(0000) GS:ffff8ad2f7840000(0000) knlGS:0000000000000000
[133676.292398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[133676.292399] CR2: 000000000206bbe4 CR3: 00000001a880a003 CR4: 00000000003606e0
[133676.292400] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[133676.292400] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[133676.292400] Call Trace:
[133676.292408]  tfw_server_destroy+0x1a/0x70 [tempesta_fw]
[133676.292412]  ss_tx_action+0x438/0x550 [tempesta_fw]
[133676.292415]  ? mlx4_en_poll_rx_cq+0xc1/0x100 [mlx4_en]
[133676.292417]  ? net_rx_action+0x25d/0x3f0
[133676.292418]  net_tx_action+0x9e/0x1d0
[133676.292419]  __do_softirq+0x10f/0x2a8
[133676.292422]  run_ksoftirqd+0x1c/0x40
[133676.292424]  smpboot_thread_fn+0x10e/0x160
[133676.292425]  kthread+0xff/0x130
[133676.292426]  ? sort_range+0x20/0x20
[133676.292427]  ? kthread_create_on_node+0x70/0x70
[133676.292429]  ret_from_fork+0x35/0x40
[133676.292430] Code: 21 49 8b 85 88 00 00 00 48 85 c0 74 13 49 8b 4d 38 4c 89 ed 49 89 d5 48 39 cb 0f 85 34 ff ff ff 0f 0b 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 90 66 2e 0f 1f 84 00
[133676.292448] RIP: tfw_sock_srv_del_conns+0x117/0x130 [tempesta_fw] RSP: ffff9eecc321bd80
@aleksostapenko aleksostapenko added this to the 0.6 Tempesta TLS milestone Feb 1, 2019
@aleksostapenko aleksostapenko self-assigned this Feb 1, 2019
@aleksostapenko
Copy link
Contributor Author

aleksostapenko commented Feb 1, 2019

The cause of the crash is following scenario:
Thr.1: during processing of tfw_sock_srv_disconnect() executed del_timer_sync(&conn->timer);
Thr.2: in failover state; in tfw_sock_srv_connect_failover() executed tfw_connection_put_to_death() and tfw_connection_put() (putting last reference) so the conn->refcnt became equal to TFW_CONN_DEATHCNT;
Thr.2: get to connection's destructor (tfw_srv_conn_release()) and execute tfw_sock_srv_connect_try_later() - as TFW_CONN_B_DEL bit is not set yet;
Thr.2: start the timer with connection callback;
Thr.1: set TFW_CONN_B_DEL bit for connection and get to tfw_srv_conn_release() - as conn->refcnt is equal to TFW_CONN_DEATHCNT at this moment;
Thr.2: get to tfw_sock_srv_connect_try() and execute tfw_srv_conn_init_as_dead() and ss_connect() -
establishing new server connection and continue processing responses/requests;
Thr.1: execute connection's destructor.

As a result of described sequence - tfw_server_put()->tfw_server_destroy() -> tfw_sock_srv_del_conns() -> tfw_connection_validate_cleanup() can be called for server connection with true conn->refcnt && conn->refcnt != TFW_CONN_DEATHCNT condition.

Also, above sequence (the last step) may be the reason of #1080.

@aleksostapenko
Copy link
Contributor Author

There is another problem - also with race condition in tfw_sock_srv_disconnect():
Thr.1: during processing of tfw_sock_srv_disconnect() executed del_timer_sync(&conn->timer) and set TFW_CONN_B_DEL bit for the connection;
Thr.2: in failover state; in tfw_sock_srv_connect_failover() executed tfw_connection_put_to_death() and tfw_connection_put() (putting last reference) so the conn->refcnt became equal to TFW_CONN_DEATHCNT;
Thr.2: get to connection's destructor (tfw_srv_conn_release()) and execute tfw_server_put((TfwServer *)srv_conn->peer) - as TFW_CONN_B_DEL bit is already set;
Thr.1: get to tfw_srv_conn_release() - as conn->refcnt is equal to TFW_CONN_DEATHCNT at this moment;
Thr.1: executed tfw_server_put((TfwServer *)srv_conn->peer) - second time for the same connection and the same server.

So, the server's refcnt decreased twice instead of one time and server's destructor will be called earlier - in tfw_sock_srv_disconnect_srv () with peer->conn_lock held in tfw_peer_for_each_conn(), despite we called additional tfw_server_get(srv) in tfw_sock_srv_grace_shutdown_srv(). As a result we can get deadlock on nested lock on peer->conn_lock from tfw_sock_srv_disconnect() -> tfw_srv_conn_release() -> tfw_server_put() -> tfw_server_destroy() -> tfw_sock_srv_del_conns() -> tfw_connection_unlink_from_peer()->tfw_peer_del_conn()->spin_lock_bh(&p->conn_lock).

aleksostapenko added a commit that referenced this issue Feb 10, 2019
aleksostapenko added a commit that referenced this issue Feb 12, 2019
1. Resolve implicit declaration;
2. Remove 'tfw_connection_close()' call from 'tfw_sock_srv_connect_try()';
this call looked obsolete: sock's state is not established during this call,
so there are no usefull actions performed.
aleksostapenko added a commit that referenced this issue Feb 14, 2019
Fix #1175: Add synchronization of disconnect procedure for server con…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant