Multiprocesses Communication

This is HO NGOK CHAO's Readme file.

Project Description

This project has two parts.

Part 1 is basically an extension of project 1 except this time instead of transferring a file available locally, the gfserver is repurposed as webproxy and with callback function libcurl registered so it is expected that the file would be downloaded from network upon the request from gfclient.

credit: https://github.gatech.edu/gios-spr-23/pr3

Part 2 is about IPC communication, where instead of using libcurl to download from network, the requirement is now to build a Cache process (simplecached) which would read and then deliver content of local file to Proxy process. The delivery mechanism involves two channels: data channel implemented using POSIX Message Queue and command channel using POSIX shared memory API

credit: https://github.gatech.edu/gios-spr-23/pr3

In part 2, no download from external network happened for simplicity; hence when proxy request some URL which Cache doesn't have locally (cached), we simply return GF_FILE_NOT_FOUND to Client.

Project Design

part 1

Two seperate curl sessions (curl_easy_perform) happened.

One for HTTP HEAD request so that we only obtained the Content-Length from header via CURLINFO_CONTENT_LENGTH_DOWNLOAD_T and CURLOPT_NOBODY so that we don't need to read everything into memory to calculate the file size to send the header to Client.

And then the other curl can use CURLOPT_WRITEFUNCTION and CURLOPT_WRITEDATA to send each chunk of data received from external network (via http) to Client.

Trade-offs explained:

Content-Length may not be accurate if the content to be downloaded is actually generated dynamically. But sending to client whenever we receive a chunk is faster then reading everything and send to client and also more memory parsimonious.
Two curl sessions is slower than one curl which write header and body data seperatedly to two different memory location via CURLOPT_WRITEDATA and CURLOPT_HEADERDATA like in this example https://curl.se/libcurl/c/CURLOPT_WRITEFUNCTION.html but then we need to efficently parse for Content-Length and for simplicity I didn't choose this method.

part 2

In order to fulfill the requirement of communication between Proxy and Cache where Proxy send a request to Cache and Cache feedback Proxy the data requested, we need two channels, command channel for sending request and data channel for content sharing.

The following diagram explained the general flow of information between one proxy worker thread (serving one client request) and one cache worker thread (serving one proxy request)


Proxy->Cache: Please write /paraglider.jpg to segment /gios-spr-23-pr3-25496
Note left of Proxy: Unlock writing semaphore \n and block on reading semaphore
Note right of Cache: Read local file,\n write to segment
Cache-->Proxy: unblock reading semaphore
Note right of Cache: Block on writing semaphore
Note left of Proxy: Read content from segment
Proxy->Cache: another request

command channel implemented using POSIX Message Queue
data channel using POSIX shared memory API

While it is possible to use shared memory for everything for these two unrelated processes (one is not the parent of the other). It is easier and less error-prone to use OS provided facility to send request with small data.

But for content sharing where the data size is considerably larger it is better to use shared memory for performance reason.

Trade-offs for POSIX Message Queue:

slower than shared memory, but kernel takes caring of synchronization
while command channel is one way (proxy to cache) so we can use FIFO as well, but we would need to open N (given N pair of proxy and cache thread) FIFO if we don't want to be limited by PIPE_BUF and read exactly predefined message size of data each time from the FIFO. This is because if we choose to have only one FIFO and if message is larger than PIPE_BUF the message send by one proxy thread would mix with another, and there is no concept of cutoff builtin other than the developer who know the predefined message size.
In my opinion POSIX Message Queue has better flexibility and cleaner where message size can be set by msgsize_max (which has hard limit as well HARD_MSGSIZEMAX depending on the platform) and there is explicit invalid argument error for message queue; and the OS takes care of everything for you including poping exactly one message and synchronization.
Although for performance concern (need to copy from user to kernal and then to user space), minimum information is put into the command channel and in this implementation only the segment name and segment size is put there (details in implementation section) so PIPE_BUF could have been enough.

Trade-offs for POSIX shared memory

POSIX shared memory is less widely available (especially on older systems) than System V shared memory. But the interface is easier to use (directly using a path as key without the need to call to another function) and a file descriptor is returned.
Faster without the back and forth between user space and kernel space, but then we need to take of synchronization, in my implementation, I used one reading semaphore and one writing semaphore. They are used in the manner of the follow simplified diagram (a more detailed one in control graphs where data need to be read/send chunk by chunk).

stateDiagram
    sem_post(wsem) --> sem_wait(wsem) : unblock
    sem_post(rsem) --> sem_wait(rsem) : unblock
    mq_send(msg) --> mq_receive(msg): unblock
    
    Note left of mq_receive(msg): blocking
    Note left of sem_wait(wsem): blocking
    Note left of sem_wait(rsem): blocking
    state Cache {
        [*] --> mq_receive(msg)
        mq_receive(msg) --> sem_wait(wsem)
        sem_wait(wsem) --> copy_data_to_shm
        copy_data_to_shm --> sem_post(rsem)
        sem_post(rsem) --> sem_wait(wsem)
    }
    
    state Proxy {
        [*] --> mq_send(msg)
        mq_send(msg) -->sem_post(wsem)
        sem_post(wsem) --> sem_wait(rsem)
        sem_wait(rsem)--> gfs_send(shm,client)
        gfs_send(shm,client) --> sem_post(wsem)
    }

Flow of Control Graphs

part 1

Webproxy's control graphs is really the same as Project one except the callback function is changed to curl which can be found here https://hackmd.io/KsB88XFNSd2idCGiZM7Qrw?view

part 2

Proxy has to established Message Queue for shared memory path, Shared Memory segments for the content plus auxiliary info, Linked List for for boss thread to enqueue shared memory and worker threads when one of them picked up one client request, it would pop one available segment from the Linked List and share its name via Message Queue with Cache, and the data sharing would continue as can been seen in the graph below. When the worker thread done serving the client request, it would restore and enqueue the segment back to the Linked List. The Linked List is protected by Mutex and Condition Variable.

To allow unique path name for multiple Proxy server, /gios=spr-23-pr3-{port * segmentid} is used as pathname, since each Proxy server would use different port.

webproxy.c

graph TD

A[set URL base,nsegments,port,number of worker threads,segsize] --> C{open shared memory segment <br> with path port number * segmentid}
C -->|Yes| D{map shared memory to virtual address space}
C -->|No| E[Exit]
D -->|Yes| F[steque_init gSegQueue,gAllSegQueue]
D -->|No| E
F --> H[Create read and write semaphores for each segment and called steque_enque]
H --> I[Create message queue if does not exist]
I --> J[register function pointer handle_with_cache]
J --> K[start server looping worker threads each serving one client request]
K --> K

Details of send and receive while loop communicating between Cache and Client

graph TD
    C{Segment status is not DONE?}
    C -->|Yes| D[sem_post wsem]
    C -->|no| E[restore segment and return bytes_transferred]
    D -->F[sem_wait blocking until cache worker thread call]
    F -->G{segptr->fileSize < 0}
    G -->|No| H[send header GF_OK]
    G -->|Yes| I[send header GF_FILE_NOT_FOUND]
    I --> E
    H -->J{segment status is not ERROR}
    J -->|Yes| K[gfssend]    
    J -->|No| L[restore segment and return SERVER_FAILURE]
    K --> C


sig=>operation: received SIGTERM or SIGINT
loop=>operation: - pop all segment of gAllSegQueue
- release all segment(including its semaphore) inside
dqueue=>operation: steque_destory(gAllSegQueue)
steque_destroy(gSegQueue)
dmsg=>operation: mq_close and mq_unlink
end=>operation: exit

sig->loop->dqueue->dmsg->end

Cache has to established Message Queue for shared memory path if it is not already created by Proxy, and then the main thread would spawn worker pthreads each thread would serve one request from proxy until status done; the worker thread would be trapped in a while loop and is blocked by msg_wait() until proxy called msg_send() with the path of Shared Memory segments, which the worker thread would open and map the share memory segment to the current virtual address space for writing the content it read by request and call sem_post(rsem) to unblock Proxy worker trhead which is blocked on rsem (reading semaphore).

In simplecached.c

graph TD
M[mq_open<br>create if not exist] --> S[pthread_create registering callback function sendjob]

S --> E[each thread trapped in a while loop serving proxy request]
E --> E

sig_handler for simplecached.c

graph TD
A[received SIGTERM or SIGINT]--> B[mq_close and mq_unlink]
B --> C[simplecache_destroy]
C --> D[END]

While loop

graph TD
C[blocking until msg arrived]-->D[open share memory with path from msg]
D --> E[map share memory segment into VA]
E --> F{get fd}
F -->|Success| J
F -->|Fail| G[set status ERROR and fileSize -1]
H --> L{bytes_copied < read_len <br> if read_len == 0 <br>set Statud Done}
J{bytes_transferred <=file_len}
J -->|exit from break with status DONE| K
J -->|Yes| H[pread with offset bytes_transferred]
L -->|Yes| M[Write at most segment size bytes of data to segment <br>bytes_transferred += transfer size<br>bytes_copied += write_len<br>sem_post reading semaphore to unblock Proxy<br>sem_wait to block on writing semaphore]
L -->|No| J
M --> L

G -->K[sem_post reading semaphore to unblock Proxy and munmap]
K -->C

Implementation

The Project Design and Flow of Control Graphs section already explained in details why certain IPCs are chosen and the control flow for using them.

This section illustrate some structs and enum which controled what and how it is shared.

For each request there is three status possible,

START for the proxy request not yet be served
ERROR for the proxy request cannot be served because of various reading error or file not found error(distinguish from other error by fileSize == -1)
DONE proxy request completed, all content in the share memory segment

typedef enum transfer_status {START,ERROR,DONE} transfer_status;

Details are written in the comment for seg_info, this is the variable type for Node of Linked List. Note that the data member of this struct has flexible size which is supported since C99; later we assign in total sizeof(seg_info)+segsize memory to the whole struct object where segsize bytes of share memory is allocated for this data member.

typedef struct seg_info{
  char name[100]; //segment name on shm_open
  char filePath[1024]; // request file path
  int segsize; //segment size specified by user
  int tranSize; // number of bytes written to data segment
  int fileSize; // -1 if file not found otherwise file size in bytes
  int header;
  sem_t rsem; // inter process read semaphore for the cache to tell proxy ready to read
  sem_t wsem; // inter process write semaphore for the proxy to tell cache ready to write
  transfer_status status;
  char data[];// flexible array member

} seg_info;

Each request from Proxy to Cache is start by msg_send which send a message of this type message_info; minimum information is contained here for faster message delivery and leave other information to seg_info above since shared memory is more efficient for large data transfer.

typedef struct message_info{
  char name[30]; // shared memory path
  int segsize;
} message_info;

Some global variables on BSS segments shared among boss and worker threads.

gSegQueue for multiple Proxy worker thread to obtain and release segment (one at a time protected by mutex and condition variavle).
gAllSegQueue is a Queue with all shared memory segments so that signal handler can free all segments at all time immediately.
gRequestQueue is the message queue shared by all Proxy threads.

steque_t *gSegQueue, *gAllSegQueue;
mqd_t gRequestQueue;

Testing

Tested using base,stress,soak test in ipcstress.py from https://github.gatech.edu/cparaz3/6200-tools

python3 ./ipcstress.py /path/to/project/cache test-name Current tests: base - the baseline test, get this to work first stress - stress with an increasing number of threads but fixed parameters soak - run for a long time with a fixed number of threads

In each test, subprocesses are opened for Proxy and Cache via popen() and utime and stime data is retrieved from /proc/{pid}/stat's 13 and 14 column (counting from 0).

def read_cpu_times(pid: int) -> Tuple[int, int]:
    """ Read utime (user time) and stime (system/kernel time) for a PID, in ticks. """
    with open(f'/proc/{pid}/stat', 'r') as file:
        entries = file.readline().rstrip().split(' ')
        return int(entries[13]), int(entries[14])

Two displays for the scripts:

parameters input:

cache_thread_count=100, proxy_thread_count=100, proxy_segment_count=50, proxy_segment_size=1048576, download_thread_count=100, request_count=1000000

performance output:

8000/1000000 in 1.00s, 998.76 rps, 3070636950 bps, cache: 0.25s 24.97% user, 0.47s 46.94% kernel, 0.72s 71.91% total, proxy: 0.06s 5.99% user, 0.47s 46.94% kernel, 0.53s 52.93% total Subprocess download client still running:

9000/1000000 in 1.00s, 999.09 rps, 3071660223 bps, cache: 0.27s 26.98% user, 0.47s 46.96% kernel, 0.74s 73.93% total proxy: 0.09s 8.99% user, 0.44s 43.96% kernel, 0.53s 52.95% total

which means serving 1000 requests in one second, 999.09 requestes served per second, 3071660223 bytes per second transferred

for cache subprocess 0.27 seconds on user mode, 0.47 seconds on kernel mode, hence CPU usage (the core it is running on ) 73.93%

for proxy subprocess 0.09 seconds on user mode, 0.44 seconds on kernel mode, hence CPU usage 52.95%

On proxy, the heavily use of kernel seems to spent on copy_user_enhanced_fast_string and clear_page_erms hence I suspect it is clearing socket memory or copy from shared memory to it make these cost. Suppose we could open a file descriptor for the requested file directly and directly send to socket fd via sendfile() without copying from kernel, then our process could be faster.

from https://man7.org/linux/man-pages/man2/sendfile.2.html

sendfile() copies data between one file descriptor and another.

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

Using sudo perf top -p

Samples: 5M of event 'cycles', 4000 Hz, Event count (approx.): 39 lost: 0/0 drop: 0/0
Overhead  Shared O  Symbol
  53.85%  [kernel]  [k] clear_page_erms
  43.59%  [kernel]  [k] copy_user_enhanced_fast_string
   2.56%  webproxy  [.] __sanitizer::mem_is_zero

On cache, the kernel time is spent on copy_user_enhanced_fast_string only, I think this is because I used pread which would copy from the kernal space memory to user space memory, and then memcpy to shared memory data member, but I could also use splice() which should be faster since it avoid the copying from kernel to user.

from https://manned.org/splice.2

splice() moves data between two file descriptors without copying between kernel address space and user address space.

Samples: 45K of event 'cycles', 4000 Hz, Event count (approx.): 12288907981 lost: 0/0 drop: 0/0
Overhead  Shared Object       Symbol
  35.56%  [kernel]            [k] copy_user_enhanced_fast_string
  19.54%  libc-2.31.so        [.] __opendirat
  10.04%  simplecached        [.] __sanitizer::mem_is_zero
   4.52%  [kernel]            [k] filemap_get_read_batch
   2.47%  [kernel]            [k] filemap_read
   1.63%  [kernel]            [k] copy_page_to_iter
   1.54%  [kernel]            [k] psi_group_change
   0.69%  [kernel]            [k] gup_pgd_range
   0.63%  [kernel]            [k] mark_page_accessed