Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add last_error_time/last_error_message/last_error_stacktrace/remote columns for system.errors #21529

Merged
merged 12 commits into from
Mar 17, 2021

Conversation

azat
Copy link
Collaborator

@azat azat commented Mar 8, 2021

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add last_error_time/last_error_message/last_error_stacktrace/remote columns for system.errors

Follow-up for: #16438

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Mar 8, 2021
@azat azat changed the title system.errors improvements Add last_error_time/last_error_message/last_error_stacktrace/remote columns for system.errors Mar 8, 2021
@KochetovNicolai KochetovNicolai self-assigned this Mar 9, 2021
@azat
Copy link
Collaborator Author

azat commented Mar 10, 2021

Rebased to fix conflicts in tests/queries/skip_list.json (previous HEAD was c0a02c9ce9833362485638c24007318923eca09c)

@azat azat force-pushed the system.errors-improvements branch from c0a02c9 to 441efef Compare March 10, 2021 05:56
@azat
Copy link
Collaborator Author

azat commented Mar 11, 2021

Testflows check — failed: 751, passed: 2612, other: 26

Broken in upstream too

Functional stateless tests (memory) — Timeout :(
Functional stateless tests (thread) — Timeout :(

Any clue? Links are under nda.

Integration tests (asan) — fail: 3, passed: 1162, error: 0
Integration tests (thread) — fail: 20, passed: 445, error: 5

Some timeout issues.

@KochetovNicolai
Copy link
Member

Any clue? Links are under nda.

Also timeout issues. I have increased timeout for msan.

Copy link
Member

@KochetovNicolai KochetovNicolai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@azat azat force-pushed the system.errors-improvements branch from 9e3f865 to 9dee842 Compare March 16, 2021 19:31
@azat
Copy link
Collaborator Author

azat commented Mar 17, 2021

Stress test (thread) — Fatal message in clickhouse-server.log

2021.03.16 23:51:22.493829 [ 295 ] {} Application: Child process was terminated by signal 9 (KILL). If it is not done by 'forcestop' command or manually, the possible cause is OOM Killer (see 'dmesg' and look at the '/var/log/kern.log' for the details).

Integration tests (thread) — fail: 26, passed: 488, error: 0

Does not looks related and fails not the first time

@KochetovNicolai KochetovNicolai merged commit 4f1f344 into ClickHouse:master Mar 17, 2021
@azat azat deleted the system.errors-improvements branch March 17, 2021 19:19
@tavplubix
Copy link
Member

Integration tests (thread) — fail: 26, passed: 488, error: 0

Does not looks related and fails not the first time

It does look related: https://gist.github.com/tavplubix/012244641439238ef9e4bf21f9f57bb0
Test test_distributed_respect_user_timeouts used to be flaky earlier, but seems like it became broken after #21529

@KochetovNicolai
Copy link
Member

test_distributed_respect_user_timeouts:

2021.03.17 04:20:40.405586 [ 30 ] {} <Error> bool DB::(anonymous namespace)::checkPermissionsImpl(): Code: 412, e.displayText() = DB::Exception: Can't receive Netlink response: error -2, Stack trace (when copying this message, always include the lines below):

0. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/exception:0: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x162ecb9b in /usr/bin/clickhouse
1. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:56: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8d3bf24 in /usr/bin/clickhouse
2. ./obj-x86_64-linux-gnu/../src/Common/TaskStatsInfoGetter.cpp:0: DB::(anonymous namespace)::query(int, unsigned short, unsigned int, char8_t, unsigned short, void const*, int) @ 0x8d8b37e in /usr/bin/clickhouse
3. ./obj-x86_64-linux-gnu/../src/Common/TaskStatsInfoGetter.cpp:181: DB::(anonymous namespace)::getFamilyIdImpl(int) @ 0x8d8b4f8 in /usr/bin/clickhouse
4. ./obj-x86_64-linux-gnu/../src/Common/TaskStatsInfoGetter.cpp:0: DB::TaskStatsInfoGetter::TaskStatsInfoGetter() @ 0x8d8ad5f in /usr/bin/clickhouse
5. ./obj-x86_64-linux-gnu/../src/Common/TaskStatsInfoGetter.cpp:293: DB::(anonymous namespace)::checkPermissionsImpl() @ 0x8d8ab1b in /usr/bin/clickhouse
6. ./obj-x86_64-linux-gnu/../src/Common/TaskStatsInfoGetter.cpp:0: DB::TaskStatsInfoGetter::checkPermissions() @ 0x8d8aa89 in /usr/bin/clickhouse
7. ./obj-x86_64-linux-gnu/../src/Common/ThreadProfileEvents.cpp:0: DB::TasksStatsCounters::create(unsigned long) @ 0x8d821da in /usr/bin/clickhouse
8. ./obj-x86_64-linux-gnu/../src/Interpreters/ThreadStatusExt.cpp:0: DB::ThreadStatus::initPerformanceCounters() @ 0x12c90ebe in /usr/bin/clickhouse
9. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/atomic:993: DB::ThreadStatus::setupState(std::__1::shared_ptr<DB::ThreadGroupStatus> const&) @ 0x12c90c27 in /usr/bin/clickhouse
10. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:3211: DB::CurrentThread::initializeQuery() @ 0x12c91e75 in /usr/bin/clickhouse
11. ./obj-x86_64-linux-gnu/../src/Core/BackgroundSchedulePool.cpp:239: DB::BackgroundSchedulePool::attachToThreadGroup() @ 0x1250cc9b in /usr/bin/clickhouse
12. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/atomic:1006: DB::BackgroundSchedulePool::threadFunction() @ 0x1250cd8e in /usr/bin/clickhouse
13. ./obj-x86_64-linux-gnu/../src/Core/BackgroundSchedulePool.cpp:0: void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPool::ThreadFromGlobalPool<DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, unsigned long, char const*)::$_1>(DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, unsigned long, char const*)::$_1&&)::'lambda'(), void ()> >(std::__1::__function::__policy_storage const*) @ 0x1250d591 in /usr/bin/clickhouse
14. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2210: ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x8d79866 in /usr/bin/clickhouse
15. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:1655: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()> >(void*) @ 0x8d7d509 in /usr/bin/clickhouse
16. __tsan_thread_start_func @ 0x8c4f5ed in /usr/bin/clickhouse
17. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
18. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.4.1.6272)
2021.03.17 04:21:23.501826 [ 52 ] {e3035907-18f7-425d-81c9-4b9c5f69c6af} <Warning> HedgedConnectionsFactory: Connection failed at try №1, reason: Code: 209, e.displayText() = DB::NetException: Timeout exceeded while reading from socket (172.20.0.3:9440) (version 21.4.1.6272)
2021.03.17 04:21:24.803583 [ 52 ] {e3035907-18f7-425d-81c9-4b9c5f69c6af} <Warning> HedgedConnectionsFactory: Connection failed at try №2, reason: Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 172.20.0.3:9440 (node1:9440) (version 21.4.1.6272)
2021.03.17 04:21:25.807149 [ 52 ] {e3035907-18f7-425d-81c9-4b9c5f69c6af} <Warning> HedgedConnectionsFactory: Connection failed at try №3, reason: Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 172.20.0.3:9440 (node1:9440) (version 21.4.1.6272)
2021.03.17 04:21:26.312424 [ 11 ] {e3035907-18f7-425d-81c9-4b9c5f69c6af} <Error> executeQuery: Code: 279, e.displayText() = DB::NetException: All connection tries failed. Log: 

@azat
Copy link
Collaborator Author

azat commented Mar 23, 2021

It does look related: https://gist.github.com/tavplubix/012244641439238ef9e4bf21f9f57bb0

At least it should not, I will take a look (FYI I've added some information into your gist)

@KochetovNicolai
Copy link
Member

@azat , we assume it may be because getStackTraceString() is slow, so tests do no fit in timeout.
Maybe we should store stack trace array instead of string? Probably, convert it to string only when read from system.errors

@azat
Copy link
Collaborator Author

azat commented Mar 23, 2021

we assume it may be because getStackTraceString() is slow, so tests do no fit in timeout.

That's a good idea.

Maybe we should store stack trace array instead of string? Probably, convert it to string only when read from system.errors

I though about this, but decided to keep it simple, and just store string.
But seems that it worth it, let's try - #22058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants