Skip to content
Sergey Bronnikov edited this page Oct 27, 2023 · 11 revisions

Debugging Tarantool

Status: Work-in-progress

General Recommendations

Try to reproduce a problem on the latest version. Probably it is already fixed.

Debugging environment

Sometimes Tarantool has a problem that triggered in CI runner and cannot be reproduced locally. There is a helpful Github Action action-tmate (doc) that helps to debug your GitHub Actions via SSH by using tmate.

By default a failed step will cause all following steps to be skipped. You can specify that the tmate session only starts if a previous step failed.

  - name: Setup tmate session
    if: ${{ failure() }}
    uses: mxschmitt/action-tmate@v3

Reproduced locally? If so, nice. If not, whether it reproduces on a given OS or distro? If so, good. strace and gdb requires ptrace capacity of a Docker container: use docker run --privileged -it blah:blah (or --cap-add SYS_PTRACE). If you setup everything already in a container, but forgot --privileged, stop it, add ptrace capacity into json, start again.

grep -o CapAdd /var/lib/docker/containers/6b842b1290f5ca4e7b6a46bc6b985a93ab52f70d68804c75cc5c180e113126f8/hostconfig.json`
CapAdd:["SYS_PTRACE"]

Using Git bisection

If you suspect that the problem is a regression, let's check and, if so, just bisect it. It is semi-automatic action that may give much more information about the problem without any real investigation.

There are good examples, where git bisect allows to better understand the problem in a short time:

$ git clone --recurse-submodules git@github.com:tarantool/tarantool.git
$ cd tarantool

$ git bisect start
$ git clean -xffd; git submodule foreach --recursive 'git clean -xffd'
$ cmake . -DCMAKE_BUILD_TYPE=Debug -DENABLE_BACKTRACE=ON -DENABLE_DIST=ON -DENABLE_FEEDBACK_DAEMON=OFF -DENABLE_BUNDLED_LIBCURL=OFF && make -j
$ ./src/tarantool ../repro.lua # !! BUG !!
$ git bisect bad

$ git clean -xffd; git submodule foreach --recursive 'git clean -xffd'
$ git checkout <..commit hash..>
$ git submodule update --recursive --init
$ cmake . -DCMAKE_BUILD_TYPE=Debug -DENABLE_BACKTRACE=ON -DENABLE_DIST=ON -DENABLE_FEEDBACK_DAEMON=OFF -DENABLE_BUNDLED_LIBCURL=OFF && make -j
$ ./src/tarantool ../repro.lua # !! OK !!
$ git bisect good
w
<..repeat git clean, cmake && make, repro.lua..>

Using verbose mode

Use say_debug in C, io.stderr:write() or log-methods in Lua.

See https://www.tarantool.io/en/doc/latest/reference/reference_lua/log/

Enabling verbose mode in third party components:

Using debug build

A weird someone-breaks-a-memory problem may be caught early on some assertion in the Debug build. Our packages are built with -DCMAKE_BUILD_TYPE=RelWithDebInfo: it has debugging info, but has no assertions and built with -O2. Debug mode could be enabled with -DCMAKE_BUILD_TYPE=Debug. Use debug build for a module if the problem may be there.

Using Address Sanitizer/Valgrind

Build Tarantool with enabled Address Sanitizer (ENABLE_ASAN) or Valgrind (ENABLE_VALGRIND).

ASAN vs Valgrind: https://developers.redhat.com/blog/2021/05/05/memory-error-checking-in-c-and-c-comparing-sanitizers-and-valgrind and https://github.com/google/sanitizers/wiki/AddressSanitizerComparisonOfMemoryTools

Note, Valgrind cannot run executables that were built by Clang compiler. It was fixed in Valgrind 3.19, see https://bugs.kde.org/show_bug.cgi?id=452758.

Tarantool has a suppression list with known memory leaks, this list should be specified in environment variable on Tarantool start:

LSAN_OPTIONS=suppressions=${PWD}/asan/lsan.supp tarantool

Otherwise Addres Sanitizer will report about memory leaks to stdout.

==532476== Memcheck, a memory error detector
==532476== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==532476== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==532476== Command: /home/sergeyb/sources/MRG/tarantool/build/src/tarantool /home/sergeyb/sources/MRG/tarantool-ee/tarantool/test-run/lib/luatest/bin/luatest -c test/integration/general_test.lua
==532476== 
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x23
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x1b
==532476== Valgrind: debuginfo reader: ensure_valid failed:
==532476== Valgrind:   during call to ML_(img_get)
==532476== Valgrind:   request for range [299270124, +4) exceeds
==532476== Valgrind:   valid image size of 23591872 for image:
==532476== Valgrind:   "/home/sergeyb/sources/MRG/tarantool/build/src/tarantool"
==532476== 
==532476== Valgrind: debuginfo reader: Possibly corrupted debuginfo file.
==532476== Valgrind: I can't recover.  Giving up.  Sorry.
==532476==

Valgrind limitation: https://github.com/tarantool/tarantool/issues/3071

Fun fact: there is a built in Lua variable tarantoool.build.asan that controls max slice check, see commit 232c28f333ac75be26abb54dc308e2e5813b7acd:

commit 232c28f333ac75be26abb54dc308e2e5813b7acd
Author: Nikolay Shirokovskiy <nshirokovskiy@tarantool.org>
Date:   Tue Jul 11 17:36:37 2023 +0300

    fiber: turn off max slice check with ASAN

tarabrt.sh + gdb.sh

FIXME: tarabrt.sh / gdb.sh можно поменять на tt coredump уже, see https://www.tarantool.io/en/doc/latest/reference/tooling/tt_cli/coredump/

Give the following link to a customer: https://raw.githubusercontent.com/tarantool/tarantool/master/tools/tarabrt.sh Basically, usage is the following:

$ /path/to/tarabrt.sh -c /path/to/core -d /target/dir

The customer sends us a tarball with coredump file, executable, libraries.

Now use tools/gdb.sh from tarantool repository to run gdb with collected executable and libraries. Basically, usage is the following:

$ tar -xvf /path/to/archive.tar.gz
$ /path/to/gdb.sh

I have a 'screencast' of usage of a preliminary version of the script: see https://lists.tarantool.org/tarantool-patches/20210315163046.lobxpj7pgy7tvgiu@tkn_work_nb/. But most of the highlighted problems are already fixed.

Using Lua debugger

Tarantool написан на языке С, но пользователь может работать с ним с помощью Lua. Ранее отлаживать исходники на Lua с помощью стандартных средств консоли Tarantool было очень сложно. В этом релизе 2.11 добавили оболочку консольного отладчика, которая значительно упрощает отладку модулей Lua.

Отладчик Lua поддерживает все стандартные отладочные операции:

  • пошаговое выполнение;
  • проверку значений переменных в локальном или глобальном контексте;
  • обход кадров стека;
  • установку точек остановки и многое другое.
local dbg = require("luadebug")

dbg() -- Вызов оболочки отладчика

local a = 1

local function b()
    print("hello")

    local m = 1
    local b = 2
    local a = 1
    local c = m + b
    if c == 2 then
        b = 22
    end
end

if a == 1 then
    b()
    a = 2
    b()
end
$ tarantool dbg-example.lua 
luadebug: Loaded for 2.11.0-entrypoint-1031-gdc451913f
break via dbg() => debug-dbg.lua:3 in chunk at debug-dbg.lua:0
   1    local debugger = require 'luadebug'
   2    debugger()
   3 => local date = require 'datetime'
   4
   5    local T = date.new{hour = 3, tzoffset = '+0300'}
   6    print(T)
luadebug>

В дополнение к использованию отладчика через вызов require('luadebug')(), о котором говорилось выше, можно использовать менее инвазивный подход — новый параметр -d. Таким образом, вы можете отлаживать Lua-скрипты напрямую:

$ tarantool -d debug.lua
Tarantool debugger 2.11.0-entrypoint-1031-gdc451913f
type 'help' for interactive help
luadebug: Loaded for 2.11.0-entrypoint-1031-gdc451913f
break via debug.lua => debug.lua:1 in chunk at debug.lua:0
   1 => local date = require 'datetime'
   2
   3    local T = date.new{hour = 3, tzoffset = '+0300'}
   4    print(T)

Screencast: https://asciinema.org/a/560039

Note, Lua debugger has a limitations - there is no fiber support yet.

Using GDB/LLDB with extensions

Note, SMALL allocators poisons deallocated memory regions by certain symbols. It could be helpful on debugging session.

There are two debugger extensions: LuaJIT-specific luajit-gdb.py and Tarantool-specific tarantool-gdb.py. NOTE: В ближайшее время расширения luajit-gdb.py и luajit_lldb.py будут объединены.

$ cat ~/.gdbinit

# https://stackoverflow.com/a/3176802/1598057

set history save on
set history size unlimited
set history remove-duplicates 100
set history filename ~/.gdb_history

LuaJIT-specific

Load extension: source ~/tools/tarantool/luajit/luajit-gdb.py.

Tarantool-specific

Load extension: source tools/tarantool-gdb.py.

(gdb) source ~/sources/MRG/tarantool/tools/tarantool-gdb.py 
(gdb) info tt-fibers 
  Id     Target   Name                               Stack              Function
  114    TtFiber  "etcd-client read stream"          0x7f3202a80918     0x555e49eec670 <lua_fiber_run_f>
  113    TtFiber  "etcd-client write stream"         0x7f3202980a58     0x555e49eec670 <lua_fiber_run_f>
  111    TtFiber  "lua"                              0x7f3203380a78     0x555e49eec670 <lua_fiber_run_f>
  108    TtFiber  "lua"                              0x7f3203080a78     0x555e49eec670 <lua_fiber_run_f>
  105    TtFiber  "lua"                              0x7f3203980a78     0x555e49eec670 <lua_fiber_run_f>
  103    TtFiber  "luatest"                          0x7f3209380978     0x555e49ee8060 <run_script_f>
  101    TtFiber  "lua"                              0x7f3209180c68     0x555e49eec670 <lua_fiber_run_f>
* 1      TtFiber  "sched"                          * 0x7fffd0d4a118     0x0
(gdb) 
(gdb) tt-fiber 114
#1  __pthread_kill_internal (signo=6, threadid=139852930820864) at ./nptl/pthread_kill.c:78
78      in ./nptl/pthread_kill.c
(gdb) bt
#3  0x0000555e49f2012b in fiber_yield_impl (will_switch_back=true) at /home/sergeyb/sources/MRG/tarantool/src/lib/core/fiber.c:753
#4  0x0000555e49f1f5ce in fiber_yield () at /home/sergeyb/sources/MRG/tarantool/src/lib/core/fiber.c:760
#5  0x0000555e49f202af in fiber_yield_timeout (delay=15768000000) at /home/sergeyb/sources/MRG/tarantool/src/lib/core/fiber.c:804
#6  0x0000555e49f2f2ed in fiber_cond_wait_timeout (c=0x7f3209040250, timeout=15768000000)
<snipped>

Ищем файбер по fid (fiber id):

(gdb) p ((struct fiber *)((char *)cord_ptr->alive->next - 144))->fid
$98 = 125
(gdb) p ((struct fiber *)((char *)cord_ptr->alive->next->next - 144))->fid
$99 = 119

Начинать с одного ->next. Добавлять ->next до попадания.

Посмотреть гостевой стек:

(gdb) lj-stack ((struct fiber *)((char *)cord_ptr->alive->next->next - 144))->storage.lua.stack

Отсюда 144:

(gdb) p &cord_ptr->fiber->link
$94 = (rlist *) 0x55a29fb92488 <main_cord+1160>
(gdb) p cord_ptr->fiber
$95 = (fiber *) 0x55a29fb923f8 <main_cord+1016>
(gdb) p 0x55a29fb923f8-0x55a29fb92488
$96 = -144

Using network sniffing

Tarantool speaks with clients using IPROTO protocol based on MsgPack. See IPROTO description in Tarantool Documentation. Inspecting binary traffic is a boring task. There are two network sniffers that supports IPROTO decoding: Wireshark and Tcpdump. Wireshark supports custom dissectors written in Lua, tcpdump has a work-in-progress IPROTO support.

Wireshark

image

To enable the plugin, follow the steps below:

Clone the tarantool-dissector repository:

git clone https://github.com/tarantool/tarantool-dissector.git

Copy or symlink the plugin files into the Wireshark plugin directory:

mkdir -p ~/.local/lib/wireshark/plugins
cd ~/.local/lib/wireshark/plugins
ln -s /path/to/tarantool-dissector/MessagePack.lua ./
ln -s /path/to/tarantool-dissector/tarantool.dissector.lua ./

(For the location of the plugin directory on macOS and Windows, please refer to the Plugin folders chapter in the Wireshark documentation.)

Run the Wireshark GUI and ensure that the plugins are loaded:

Open Help > About Wireshark > Plugins.
Find MessagePack.lua and tarantool.dissector.lua in the list.

Now you can inspect incoming and outgoing Tarantool packets with user-friendly annotations.

Visit the project page for details: https://github.com/tarantool/tarantool-dissector.

Tcpdump

To use tcpdump with IPROTO support one need to build tcpdump with applied patch. See steps to build in https://github.com/the-tcpdump-group/tcpdump/blob/master/INSTALL.md.

Using profiling

  1. Попыриться в htop, увидеть, где bottleneck по CPU (если он есть).
  2. Еще есть 'l' вместо 'f' — дает file:lino вместо названий функций.
  3. Сам не пробовал, но есть рецепт от Сергоса:

Poor man's profiler

see https://www.tarantool.io/en/doc/latest/book/admin/server_introspection/#poor-mans-profilers

jit.p

LuaJIT has a builtin profiler. It generates simple textual summaries or source code annotations. It can be accessed with the -jp command line option or from Lua code by loading the underlying jit.p module.

Note: Sampling works for both interpreted and JIT-compiled code. The results for JIT-compiled code may sometimes be surprising. LuaJIT heavily optimizes and inlines Lua code — there's no simple one-to-one correspondence between source code lines and the sampled machine code.

  • require('jit.p').start('f', 'profile.txt')
  • require('jit.p').stop()

It is possible to gnerate flamegraph using bundled profiler:

  • require('jit.p').start('FG','output')
  • flamegraph.pl output > output.svg

flamegraph.pl is a script that can be obtained in repository https://github.com/brendangregg/FlameGraph.

See notes about jit.p in Tarantool Documentation.

Sysprof

Если в Lua не видно явно горячей функции, то тут возможны варианты: bottleneck вообще не там (см. пункт 6) или нагрузка размазана (см. пункт 7).

See https://habr.com/ru/companies/vk/articles/758128/

Memprof

See https://www.tarantool.io/en/doc/latest/reference/tooling/luajit_memprof/

perf

perf record -a -g and perf report. Тут можно увидеть, например, упираемся ли мы в декодинг tls на nginx, принимающем входящие соединения. Хотя процессными мерами (см. пункт 7) это тоже можно выяснить.

https://www.tarantool.io/en/doc/latest/book/admin/server_introspection/#perf

  1. Это общая рекомендация: отрезать крупные куски, затыкать их заглушками и смотреть, что изменилось (по RPS, по профилю — но не одновременно). Можно делать заглушки с сопоставимыми по размеру данными, можно намеренно с маленькими.

Тут клево оторвать вообще все, кроме сетевых походов, и посмотреть на недостижимый идеал. Потом отрезаниями кода понять, где сколько до идеала скрывается.

gperftools

see https://www.tarantool.io/en/doc/latest/book/admin/server_introspection/#gperftools

TODO

  • require('fiber').top_enable()
  • Большие readahead.
  • Много таплов в runtime arena.

Ещё можно в ванильном луа включить трейс. Не нашел этого в доке, обошелся этим https://www.lua.org/pil/23.2.html

LuaJIT

Is LuaJIT-specific?

  • попробовать выполнить скрипт в PUC Rio Lua

Is LuaJIT the Problem?

Is it reproduced on the latest version?

Component localisation

Попробовать выключить разные компоненты и выяснить их влияние:

Debug build

Reducing test case

амальгамация Lua скрипта

минимизация тестового примера

  • избавиться от использования внешних модулей в Lua скрипте
  • минимизировать руками или автоматически

Create a script interestingness.sh:

#!/bin/sh

<fullpath>/src/luajit crash.lua 2>&1 | tee output.txt
grep "LuaJIT ASSERT lj_asm.c:147: checkmclim: red zone overflow:" output.txt

Execute creduce:

$ creduce interestingness.sh crash.lua

creduce is good enough in reducing Lua snippets. However, one could replace creduce with another one reducer. Consider these - alternatives: cvise, treereduce and jubnzv/dd. Note, testcases produced by fuzzing tests could be reduced automatically by test itself, pass -minimize_crash=1 option and wait.

For other alternatives, see https://github.com/ligurio/sqa-wiki/wiki/testcase-minimization.

formatting

  • применить форматирование к crash.lua

TODO

References

Clone this wiki locally