Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication fails #3726

Closed
evgenykuzyakov opened this issue Sep 17, 2024 · 5 comments
Closed

Replication fails #3726

evgenykuzyakov opened this issue Sep 17, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@evgenykuzyakov
Copy link

Describe the bug
Having a master node and a few replica nodes at version: df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c
Replica sync doesn't finish with the following error from master: Replication error: Operation canceled: Context cancelled.

Tonight the sync stopped on all replicas at the same time. Having 221823184 keys total in DB in replicas. Uptime was about 8 weeks. Master node continued to work and serve read/write requests.

Master logs:

I20240917 18:07:17.605152    15 server_family.cc:1048] Load finished, num keys read: 222160007
I20240917 18:08:29.967986    26 dflycmd.cc:667] Registered replica *IP:PORT*
I20240917 18:08:30.166843    26 dflycmd.cc:349] Started sync with replica *IP:PORT*
I20240917 18:10:09.608582    25 dflycmd.cc:110] Disconnecting from replica *IP:PORT*
I20240917 18:10:09.608635    25 dflycmd.cc:657] Replication error: Operation canceled: Context cancelled
...

Replica(s) logs:

I20240917 18:08:29.952622    14 protocol_client.cc:182] Resolved endpoint HOST:PORT to IP:PORT
I20240917 18:08:30.171046    14 replica.cc:566] Started full sync with HOST:PORT
W20240917 18:10:10.876755    14 replica.cc:243] Error syncing with HOST:PORT generic:104 Connection reset by peer
...

Master info:

# Server
redis_version:6.2.11
dragonfly_version:df-v1.22.2
redis_mode:standalone
arch_bits:64
os:Linux 5.15.0-100-generic x86_64
multiplexing_api:epoll
tcp_port:3007
thread_count:16
uptime_in_seconds:1435
uptime_in_days:0

# Clients
connected_clients:122
max_clients:64000
client_read_buffer_bytes:31232
blocked_clients:1
pipeline_queue_length:0

# Memory
used_memory:52858339344
used_memory_human:49.23GiB
used_memory_peak:52858339344
used_memory_peak_human:49.23GiB
fibers_stack_vms:2783920
fibers_count:85
used_memory_rss:54798372864
used_memory_rss_human:51.03GiB
used_memory_peak_rss:70501720064
maxmemory:118111600640
maxmemory_human:110.00GiB
used_memory_lua:0
object_used_memory:36343856408
type_used_memory_string:9192048392
type_used_memory_zset:12129904
type_used_memory_hash:27139678112
table_used_memory:16272298496
num_buckets:30229920
num_entries:222160007
inline_keys:14090521
listpack_blobs:0
listpack_bytes:0
small_string_bytes:9192048392
pipeline_cache_bytes:0
dispatch_queue_bytes:0
dispatch_queue_subscriber_bytes:0
dispatch_queue_peak_bytes:0
client_read_buffer_peak_bytes:35328
tls_bytes:5664
snapshot_serialization_bytes:0
cache_mode:store
maxmemory_policy:noeviction

# Stats
total_connections_received:215
total_commands_processed:360
instantaneous_ops_per_sec:0
total_pipelined_commands:62
total_pipelined_squashed_commands:0
pipelined_latency_usec:47598
total_net_input_bytes:15065
connection_migrations:0
total_net_output_bytes:40409
rdb_save_usec:11154447278
rdb_save_count:18680815
instantaneous_input_kbps:-1
instantaneous_output_kbps:-1
rejected_connections:-1
expired_keys:0
evicted_keys:0
hard_evictions:0
garbage_checked:0
garbage_collected:0
bump_ups:0
stash_unloaded:0
oom_rejections:0
traverse_ttl_sec:0
delete_ttl_sec:0
keyspace_hits:0
keyspace_misses:0
keyspace_mutations:222160008
total_reads_processed:250
total_writes_processed:3179
defrag_attempt_total:0
defrag_realloc_total:0
defrag_task_invocation_total:0
reply_count:3179
reply_latency_usec:0
blocked_on_interpreter:0
lua_interpreter_cnt:0
lua_blocked:0

# Replication
role:master
connected_slaves:0
master_replid:5acda8918168aeb4780d45d08fb97825d9ac88ac

# Modules
module:name=ReJSON,ver=20000,api=1,filters=0,usedby=[search],using=[],options=[handle-io-errors]
module:name=search,ver=20000,api=1,filters=0,usedby=[],using=[ReJSON],options=[handle-io-errors]

# Keyspace
db0:keys=222160007,expires=0,avg_ttl=-1

# Cpu
used_cpu_sys:421.802633
used_cpu_user:1571.433115
used_cpu_sys_children:0.870
used_cpu_user_children:0.1163
used_cpu_sys_main_thread:26.262023
used_cpu_user_main_thread:78.715462

# Cluster
cluster_enabled:0

To Reproduce
Steps to reproduce the behavior:

  1. Start master node
  2. Start docker on replica node
  3. See error

Expected behavior
Sync completes

Environment (please complete the following information):

  • OS: [ubuntu 22.04]
  • Kernel: Linux dragon-1 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Containerized?: docker compose
  • Dragonfly Version: df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c (previously df-v.1.21)

Additional context
Master node is hosted behind nginx with streams enabled.

The data in the DB is not private, so I might be able to provide a snapshot, but it's around 50Gb

@evgenykuzyakov evgenykuzyakov added the bug Something isn't working label Sep 17, 2024
@chakaz
Copy link
Collaborator

chakaz commented Sep 18, 2024

Hi @evgenykuzyakov
Is the error reproducible, or did it happen just once?
Could there have been some network issue (or nginx issue) at the time of the connection errors?

@evgenykuzyakov
Copy link
Author

Is the error reproducible, or did it happen just once?

It's reproducible every time from every replica.

Will try to replicate without nginx to see if the error happens

@evgenykuzyakov
Copy link
Author

Tried replicating directly without nginx with TLS (through ssl) and got the following logs on replica:

dragonfly-1  | I20240919 16:18:03.284178     1 init.cc:78] dragonfly running in opt mode.
dragonfly-1  | I20240919 16:18:03.284281     1 dfly_main.cc:640] Starting dragonfly df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c
dragonfly-1  | * Logs will be written to the first available of the following paths:
dragonfly-1  | /tmp/dragonfly.*
dragonfly-1  | ./dragonfly.*
dragonfly-1  | * For the available flags type dragonfly [--help | --helpfull]
dragonfly-1  | * Documentation can be found at: https://www.dragonflydb.io/docs
dragonfly-1  | I20240919 16:18:03.284484     1 dfly_main.cc:703] Max memory limit is: 110.00GiB
dragonfly-1  | W20240919 16:18:03.284507     1 dfly_main.cc:367] Weird error 1 switching to epoll
dragonfly-1  | I20240919 16:18:03.362095     1 proactor_pool.cc:147] Running 12 io threads
dragonfly-1  | I20240919 16:18:03.365094     1 server_family.cc:783] Host OS: Linux 5.15.0-117-generic x86_64 with 12 threads
dragonfly-1  | I20240919 16:18:03.366552    14 server_family.cc:2598] Replicating *IP:HOST*
dragonfly-1  | I20240919 16:18:03.372794    12 listener_interface.cc:101] sock[27] AcceptServer - listening on port 3007
dragonfly-1  | I20240919 16:18:03.887799    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | W20240919 16:18:25.319118    14 replica.cc:243] Error syncing with *IP:HOST* system:103 Software caused connection abort
dragonfly-1  | E20240919 16:18:28.325552    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:28.325608    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | E20240919 16:18:28.325634    11 rdb_load.cc:2485] Could not load value for key 'pk:ed25519:7WqEQjeUTEDarpAEGt3tP8xZMmZ1rnaHMfAzuVJTxiXX' in DB 0
dragonfly-1  | E20240919 16:18:28.327425    15 rdb_load.cc:2485] Could not load value for key 'pk:ed25519:5varyXrCnPxQdGe6SoErjMpnS64SgE8PsG41MpCwLBrd' in DB 0
dragonfly-1  | E20240919 16:18:28.327435    11 rdb_load.cc:2485] Could not load value for key 'ft:jp6isgc0wqve.users.kaiching' in DB 0
dragonfly-1  | E20240919 16:18:28.327427    19 rdb_load.cc:2485] Could not load value for key 'ft:7407068177.tg' in DB 0
dragonfly-1  | E20240919 16:18:28.328070    14 protocol_client.cc:291] Socket error generic:103
dragonfly-1  | W20240919 16:18:28.328837    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:18:40.986608    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:18:41.130453    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:41.130522    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | W20240919 16:18:41.135857    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:18:51.785612    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:18:52.069803    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:52.069862    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | W20240919 16:18:52.071615    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:18:57.775456    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:18:57.917601    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:57.917649    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | E20240919 16:18:57.918133    19 rdb_load.cc:2485] Could not load value for key 'ft:n38nmj8mfpa6.users.kaiching' in DB 0
dragonfly-1  | W20240919 16:18:57.921504    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:19:03.427546    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:19:03.555168    11 rdb_load.cc:716] Ziplist integrity check failed.
...

Master logs the same errors as before: Replication error: Operation canceled: Context cancelled

For context: ft_updates is populated and consumed using RPUSH,BLMOVE. Keys pk:* and ft:* are hashmaps.

@romange
Copy link
Collaborator

romange commented Sep 19, 2024

@evgenykuzyakov Please ping me directly on discord

@romange
Copy link
Collaborator

romange commented Dec 9, 2024

This has been due to short --replication_timeout flag. The workaround is to increase the timeout to 200000 or higher.

@romange romange closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants