Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pre-threaded server #28

Closed
wants to merge 13 commits into from
Closed

Conversation

ericonr
Copy link
Contributor

@ericonr ericonr commented Jul 16, 2021

Depends on #27, so only the last two commits need to be reviewed.

@ericonr ericonr force-pushed the cache-thread branch 2 times, most recently from 71a9f1a to 54df020 Compare July 31, 2021 04:25
@ericonr ericonr force-pushed the cache-thread branch 2 times, most recently from d64356c to 552a29e Compare August 16, 2021 17:04
These functions will just return notfound for now, not doing anything at
all with the values they receive. They will allows us to plug them into
socket_handle.c, then we can implement the cache for each struct individually.
In order to implement cache cleanly, it's better to have return_result
aware of it than simply add it as an extra module. This way, we can
avoid allocations and deal with additions to the cache in a simple way.

To implement cache querying, we have to change the module loop
structure, allowing for one initial run that uses the cache before the
subsequent runs that use modules listed in passwd_mods or group_mods.
Since we are changing the loop structure already, we take the
opportunity to move as many variables as possible into loop scope. We
also add some booleans to handle logic that was using some shortcuts.

In this commit, we also change the memory allocation strategy, both in
avoiding sysconf calls (they are now called once in
init_socket_handling) and avoiding allocating buf for the initgroups
case, which doesn't need buf and allocates its own array in nss_getkey.

Since we now support adding entries to caches, we need to delay some
free() calls, so the FREE_ALLOC macro is added to not have to duplicate
the logic of choosing which buffer to free.

We also change the error handling when writing values back to clients:
instead of immediately returning if writing fails, we still try to
record that value in the cache, since a client failure doesn't mean the
entry is now invalid. This also fixes a memory leak if the value was
invalid (write_* function returns -2) and the action for unavailable
status wasn't to return.

This commit also fixes a bug where a RETURN action would still run the
fallback block at the end of the return_result function, which would
write to a now closed (on the client side) file descriptor - this could
be checked by looking at strace output, where a write syscall got back
EPIPE as error. The program masks SIGPIPE, so this didn't lead to
program death, but it was an unnecessary syscall anyway.
Now that the loop logic has changed and we have added more eager
returns, we don't need a conditional gating this anymore, and can simply
call write_* for the appropriate request.
We use pw_name and pw_uid fields as keys to query the cache. Since
accessing the cache for getpwnam and getpwuid is very similar, and will
also be very similar to getgrnam and getgrgid, we create a cache_query.h
header with all the logic. Since addition to the cache will be similar
to adding to struct group caching, we also create a header for that.

Using an inline struct passwd in cache entries instead of dynamically
allocating one can be more wasteful when the array isn't full, but it
also avoids memory fragmentation due to too many allocations and
simplifies memory handling in general.

This commit also adds policy for the cache: how long an entry is valid
for, the maximum amount of entries allowed and the initial amount of
entries.

The invalidation logic works as follows: the add function will go
through the cache to see if some other thread didn't end up adding the
entry already or if an outdated version of it exists. If there is an
entry with our key, we check if it's valid; if yes, we simply return, if
no, we will overwrite that entry. In case there isn't a pre-existing
entry, we will either overwrite the most outdated invalid entry (this
avoids growing the cache unnecessarily, and works almost like a
least-recently-used cache), or, if such an entry doesn't exist, we will
increase the cache size.

We also add a static variable to control whether caching is enabled, and
macros to use it automatically. If it's disabled, query functions can
immediately return UNAVAILABLE status, and add functions can immediately
free the buffer passed to them (the contract of the add functions is
that they take ownership of the buffer passed to them).
Since the code is basically identical to struct passwd handling, this
commit just makes use of the cache_add.h and cache_query.h headers, and
ends up rather short.
The logic is similar to struct passwd caching, but re-using the headers
isn't possible, so this commit has to reimplement most of it. The query
function could have avoided dynamic allocations by passing its own
buffer, but that would have required a lot of additional logic in
return_result.

We re-use struct initgroups_res from socket_handle.c, so it's moved to
modules.h as well.
The initial design was implemented based on the idea that access to
the cache's underlying storage would avoid allocations and generally
speaking lead to a cleaner design. Unfortunately, this also opened up
the program to use-after-free issues.

A sequence of events as explained below shows a possible scenario (Tx
are threads):

1. T1 enters cache, finds valid entry, return it (returns buffer from
   cache)
2. T2 enters cache, finds the entry is invalid now, doesn't return it;
   finds new entry from nss module
3. T2 tries to add new entry to cache, will release underlying memory
   (free(buffer))
4. T1 finally gets to writing out, for _whatever reason_, and now buffer
   is UAF

There are two possible ways to deal with this:

- add a mutex to struct, lock while it's being used; means that the
  wrlock will be held for a potentially very *long* time, which means
  holding up the rdlock for consumers. Therefore, not an option
- actually allocate memory when reading from cache; not ideal but
  avoids all of the lifetime issues

We went with the latter, which required simplifying return_result by
splitting out cache query.

Especially now that we had to change how cache results are dealt with,
it's better to duplicate some code (specifically a tiny bit of logic
from nss_getkey into nss_getkey_cache and the write_* callers) and leave
the logic clearer. It also avoids ugly type and pointer puning.

The resulting nss_getkey_cache is much more direct, however; the only
logic that's really duplicated is how arguments are interpreted before
being passed to the cache query functions.
Obtain now before entering the loop and save on comparisons by storing
the return value of compare_timestamps when appropriate.
This can speed up cache access considerably when accessing elements that
are at the end of it. From local benchmarks, scanning a cache with 10k
entries doing only integer comparison is almost free, compared to the
communication overhead, but repeated strcmp calls can slow things down
some. Using a hash allows us to greatly decrease the amount of strcmp
calls.

Hash function recommendation taken from [1].

[1] http://www.orthogonal.com.au/computers/hashstrings/
The -C option now requires an argument specifying the time until a cache
entry becomes invalid. This way, admnistrators are forced to determine
and specify their own.

The new -n option specifies the maximum amount of cache entries that can
be used.

Since 0 is an invalid value for either option, we can use atol() to
parse optarg and always error out on a 0 return.

We also lower the initial entries, to minimize the memory consumption
for systems with few users, and also lower the maximum amount of entries
to 1k.
Instead of launching a thread to listen to the next request, have a
given number (for now, 5) of server threads pre-launched and a listener
thread that sends queries to the server threads. This way, we improve
latency considerably when more than one client is making queries at the
same time.

serv_thread() is the logic after the accept() call in socket_handle(),
plus the semaphore handling. socket_handle() only calls accept and sends
requests to the server threads. Since socket_handle() is only called
once now, we can remove most of its parameters; initialization of
locale_t l is moved to init_socket_handling.
This way we can also decrease the number of default threads to 2 (one
listener and one answering queries).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant