Hello. In this post I will describe a memory allocator that I have written. the main purpose for the memory allocator was to learn how it works, and also to have a way to allocate and to deallocate memory for assembly programs that I write for CCP/M-86 in assembly language. The goal was to write it in assembly, but in order to be sure that I can have something that works, I decided to make a test version in C in order to be sure that I understand how it works.
Now I will describe why I need a memory allocator. It is needed to be able to get some part of memory in order to store a dynamic data structure, like a string for example. If the user has the possibility to input a string, we cannot know in advance its size, so we get a large buffer. But once we know it, it would be better to store only what is needed. So, if, for example we have many strings of different lengths, it would be better to allocate the memory depending on the length of the string.
There are, of course many other examples why a memory allocator might be needed. Many data structures, like trees and lists are often implemented by using dynamic memory allocation.
Other use is to be able to store temporary, intermediate data. For example, we read a file and generate a list of strings from it. Then we can filter or concatenate them and finally insert the results into a hash table or write them into a file. During all these operations, we might need to generate some intermediate data structures. So, in order to be able to reuse the memory that had been used by these structures again, it is useful to free memory if it’s not used anymore.
This is exactly what memory allocators do: you can allocate an amount of memory space you need, and once you don’t need it anymore, you can return it back to the allocator, so it can be used for other allocations.
After reading about several different types of memory allocators, I decided that the allocator which uses the generalized Fibonacci sequence would be alright for my purpose: it does not waste a lot of memory (in average approximately 82.2% of the memory is used) and does not look too complicated to code (just some additions and subtractions of numbers and memory addresses).
So what is a generalized Fibonacci sequence? The usual Fibonacci sequence, which is very common in books about programming languages is the sequence described by the formula an = an-1 + an-2. So, a generalized Fibonacci sequence is when the last item has a larger number than 2. In my allocator I use the sequence generated by an = an-1 + an-4. The main advantage is that the difference between the number is not so big, which makes it use the memory space more efficiently. Another common memory allocator uses the formula an = 2 * an-1, which in some way is also a generalized Fibonacci sequence.
My allocator specifies the size of the items it allocates in blocks of 8 bytes. The main reason is to be able to store three additional flag bits in the size field in the item. It is important for space efficiency: otherwise I would lose some bytes for each allocation and I also wanted to have the returned area aligned for the needs of the OS, we cannot know: if the user decides to use it in order to store a structure or a multi-byte variable there, it would better be aligned. So if the OS returns aligned memory, my allocator will also return aligned memory. Otherwise it does not matter.
The memory allocator uses an array of free lists. A free list is a linked list of items of the same size, which is a number from the generalized Fibonacci sequence multiplied by the size of the block. It is not the area usable by the user: it also includes the header, which contains the total size of the item.
In the array the items are stored in the increasing order. The structures contained in the array are called cells in my implementation. They contain a field which specifies the size of the corresponding free list, and a field which is a pointer to the head of the free list.
This is a buddy memory allocator, which is handy with a generalized Fibonacci sequence: we can create a larger item by simply combining it with another item of a smaller size.
So this is how the buddy system works: we first allocate some memory from the Operating System and put it into the array in an appropriate free list. When the user wants some memory, we can split the memory available into buddies, where each buddy is of the size a number of the sequence multiplied bu the number of bytes in the block. After splitting, we can insert the buddy which is not used anymore into its free list, since it also has the appropriate size.
When freeing, it’s the opposite that happens: we want to return the buddy to the free list, but if it’s buddy is not in use, we can first merge it with its buddy and insert the merged block into the corresponding free list. The generalized Fibonacci sequence guarantees us that whenever we split or merge items, we have something that has a corresponding free list. In order to know how to merge and how to find the buddy, we use flags, which will be explained later.
The array that contains the free lists has to be able to grow if more items need to be allocated. That’s why I use a dynamic array. Its capacity is increased when more items need to be added. In this situation another area big enough to hold the array is allocated and a new array is created in this area, which is a copy of the old array, but with greater capacity. The old array is freed. The good thing about the array is that it only use the allocator: it does not need special Operating System calls.
In this section I will describe in more detail how everything works.
The item is a structure which is the node of the free list, a doubly linked list. The item has two modes of existence: free and used. When it’s free, it is in the free list and contains pointers to the previous and the next items. When it’s used it contains the area instead of the pointers.
But whatever the mode is, there is always a header. It contains one
single field of the size of the pointer. This field contains the size
of the item in blocks and 3 flags: lr_bit
, inh_bit
, and in_use
bit. The in_use
bit tells us whether the item is used or not. The
ls_bit
and the inh_bit
are needed in order to know how to merge
buddies: the buddy can be the left buddy or the right buddy, so we
might need to go to the right or to the left in order to find the
budd.
The item is not a C structure. It’s an area of type *void
, for
which accessor functions are used. So here are some examples of
accessor functions:
uintptr_t
item_get_size(void *item)
{
uintptr_t size_field = ((uintptr_t*)item)[0];
return size_field >> 3;
}
void
item_set_size(void *item, uintptr_t size)
{
uintptr_t size_field = ((uintptr_t*)item)[0];
uint8_t flags = size_field & 7;
uintptr_t new_size_field = flags | (size << 3);
((uintptr_t*)item)[0] = new_size_field;
}
Here are functions for getting and setting the size of the item. As
you can see, it’s located at the very beginning of the item, at index
0, and it occupies the sizeof(uintptr_t) - 3
high-order bits of the
field.
Here’s another example:
boolean
item_get_inh_bit(void *item)
{
return (((uintptr_t*)item)[0] & 1) != 0;
}
void
item_set_inh_bit(void *item, boolean in_use)
{
uintptr_t size_field = ((uintptr_t*)item)[0] & (~(uintptr_t)1);
uint8_t tmp = in_use ? 1 : 0;
uintptr_t new_size_field = size_field | tmp;
((uintptr_t*)item)[0] = new_size_field;
}
Now we need to set one single bit. It’s the smallest bit in the field, so we use 1 here.
And another important thing, is the area. The area is the part of the item which is returned to the user. This is how we do it:
void*
item_get_area(void *item)
{
return &((void**)item)[1];
}
When the user returns the area to us, using the mem_free
function,
we need to get the address of the item:
void*
item_from_area(void *area)
{
return &((void**)area)[-1];
}
We assume that the user didn’t cheat and didn’t write beyond the area, which means that we trust him, and our data is still the way how we left it. This is the notion of cooperation, which comes very often in the field of programming.
Now let’s talk about the next structure, which contains items, namely, the cell. Here is its definition:
struct cell {
uintptr_t size;
void *items;
};
It contains the size and the pointer to the first item of the free list. As you can see, there is a size field in the cell structure, as well as a size in the header of the items. It must match whenever the items are in the free list. This duplication is needed because the list might be empty, which means, we need a way to know the size. Also, when items are not in the free list, when they are in use, we need to know where to return them.
struct array {
struct cell *data;
unsigned int size;
unsigned int capacity;
};
Actually, the array is also a structure, and the array is one of its
field: data
. Here also nothing is complicated. As I already
explained, there is a capacity, which tells how many elements can be
stored in the array, and the size
, which tells its current size.
When we get some chunks of memory from the Operating System, we organize them into a linked list in order to be able to free them when needed. So, every time we use the first size-of-pointer bytes of the memory area we receive to store the address of the next chunk. This way, if the Operating System requires us to free all the allocated memory before exiting the application, we free it by using this linked list.
Here are the lines of code that do it:
while (mem_list != NULL)
{
void *tmp = mem_list;
mem_list = *((void**)mem_list);
free(tmp);
}
We need a source for the memory in order to give it to the user. For this we use the memory allocator of the Operating System. But we don’t know how good it is, which means, we should not rely on it too much. We should only use it when we need some memory, and only for large chunks of memory.
For this reason I decided to impose the following rules:
- There is a minimum amount that we should ask from the OS, which is 64 times the size of the pointer.
- When we ask, we ask for a larger amount than the previous time.
- We do not return memory to the Operating System until the very end, when we free everything at the end of the application.
This way we do not assume that the memory allocator of the Operating System is good or efficient. It’s our job to make the memory allocation work. The memory allocator of the Operating system can be extremely simple. It can even not be an allocator at all, but just a pointer in a large amount of memory, since we don’t need any complex functionality from it. It can very well be the Transient Program Area (TPA) of CP/M-80.
The first four items in the array cannot be split because there is nothing smaller. This means, in case we need to allocate a lot of small items, it’s better to have these unsplittable items as small as possible in order to no waste space.
So let’s calculate the minimal size, which will be the size of items of the first free list in the array.
For 64-bit we have blocks of 8 bytes and pointers also of 8 bytes. An item has to contain a header, a next pointer and a previous pointer. Together it makes 24 bytes, which can be stored in 3 blocks. Thus the first size will be 3.
For 32 bits blocks are of 8 bytes (it doesn’t change) and pointers are of 4 bytes. Three pointer-sized fields need 12 bytes, which we can put in 2 blocks (16 bytes). That is, the smallest size on a 32-bit Operating System will be 2.
For a 16-bit Operating System it’s similar. Pointers are 2 bytes and blocks are 8 bytes. I use 2 bytes for pointers because I don’t want to make things complicated by using segments. So we can put 6 bytes (3 fields) into a single block. Which makes the smallest size for items 1.
When we allocate memory, we get a chunk of memory that me might have to split into buddies, and only one of the buddies will be returned to the user.
When we free memory, we insert an item into the array and it’s possible that we might have to merge it recursively with buddies if they are not in use.
So the main problem is to find the buddy of an item. For this we use
two flags: the lr_bit
, and the inh_bit
. The lr_bit
tells us if
the buddy is a left buddy or a right buddy. The inh_bit
is used to
restore the lr_bit
and the inh_bit
of the parent buddy, so that if
we merge, we know if it’s a left buddy or the right buddy.
When splitting an item, we set the inh_bit
of the left child to the
lr_bit
of the parent and the inh_bit
of the right child to the
inh_bit
of the parent. This allows us not to lose the information
when splitting: when we merge we just get this information back from
where we stored it.
Here is an example:
\begin{tikzpicture} \tikzset { treenode/.style = {circle, draw=black, align=center} } \node [treenode] {} child { node [treenode] {L \ 95} child { node [treenode] {L \ 26} } child { node [treenode] {L \ 69} child { node [treenode] {R \ 19} child { node [treenode] {L \ 5} } child { node [treenode] {R \ 14} } } child { node [treenode] {L \ 50} } } } child { node [treenode] {/ \ /} }; \end{tikzpicture}
In this picture we see some examples that illustrate how the
inheritance and the lr_bit
is restored from children. The
inheritance bit is shown by the letter L or R.
Let’s look at the node 69. It’s a right child with left inheritance
bit. When it’s split, the right property goes to the left child as
the inheritance bit and the left property goes to the right child as
inheritance bit. When the children are merged back, both the
lr_bit
and the inh_bit
can be restored.
The same thing happens when we split the node with size 19. It’s a left child and this property is kept by the left child as its left-right bit. The node 14 is the right child, and it’s keeping the inheritance bit.
The root node on the picture actually does not exist. It’s there in
order to show the connection to the fake right node, which does not
contain any area. It’s in_use
bit is set to true in order to stop
the recursion when the children of its left buddy are merging.
Now this is how allocation works. The first step is to determine the number of blocks needed. We are given bytes and we need the number of blocks. Also we shouldn’t forget about the header, the size of which is also included in the size of the item.
Once we have the minimal number of blocks, we look for a suitable item in the array. Perhaps we find it, perhaps we don’t. If we don’t find it, we need to allocate more memory from the Operating System.
Then, after we have our item, which can come either from the array or from the OS, we need to split it as much as possible in order to take as little memory as possible for our needs.
And the last thing is to set the in_use
flag and to calculate the
address of the area (which is actually the address of the item plus
the size of the header). It’s important that the user does not access
the header! So we return the area.
Freeing is more or less the opposite of the allocation: we calculate
the address of the item, set the in_use
bit to false, insert the
item into the array and merge the item with free buddies as much as
possible. It’s important to guarantee that whenever we give an item
to the user, we have a place in the array where to put it when it’s
freed.
The array is a dynamic array: when it reaches its maximal capacity, it is extended by allocating a bigger array and copying the old data into the new array, after which the area occupied by the old array is freed.
One of the important things about the array is that our allocator is used to extend the array. For this reason we have to be sure that the array contains a free list which is able to “accept” our array when we need to allocate a new one. After extending the array, we also initialize the cells for the new array in order to be able to insert it into its free list, but it’s not really a problem, since if the old array could contain enough bytes for the old array, a twice bigger array is more likely to have a cell with size big enough because the size of the array grows linearly, whereas the sizes of free list have a Fibonacci-like growth, which is exponential.
Now I will show how I calculated the defines for the initialization of the array. It’s the same for 64 bits, 32 bits and 16 bits, so I’ll only show the 64 bits.
index | flsz | capacity | array-bytes | area-bytes | store-itself |
---|---|---|---|---|---|
0 | 3 | 1 | 16 | 16 | true |
1 | 4 | 2 | 32 | 24 | false |
2 | 5 | 4 | 64 | 32 | false |
3 | 7 | 4 | 64 | 48 | false |
4 | 10 | 8 | 128 | 72 | false |
5 | 14 | 8 | 128 | 104 | false |
6 | 19 | 8 | 128 | 144 | true |
7 | 26 | 8 | 128 | 200 | true |
8 | 36 | 16 | 256 | 280 | true |
9 | 50 | 16 | 256 | 392 | true |
10 | 69 | 16 | 256 | 544 | true |
As I have already described, the first size in the array will be 3. The column name flsz stands for free list item size in blocks.
The capacity column says how many cells the array contains. And the array-bytes column is the same thing in bytes.
The area-bytes column is how many bytes a free list at the given index can contain. And when this value is greater or equal to the area-bytes value, the column store-itself indicates true. Otherwise it’s false.
So, after the index 6, the array consistently can store itself. And we can be sure of that because the growth of the size of the free lists is greater than the growth of the capacity.
But I have decided to avoid allocating small amounts of memory space.
For 64 bits it’s 512, which corresponds to the row with index 10.
As we see from this table, allocating 544 bytes for the initial array
is completely possible and that’s what I do. Here is the code form
the file mem.c
:
/* 64-bit OS */
#if defined(__x86_64__)
#define MIN_SIZE 3
#define SIZE_1 4
#define SIZE_2 5
#define SIZE_3 7
#define DATA_INIT_BLOCKS 69
#define ARRAY_INIT_SIZE 11
#define ARRAY_INIT_CAPACITY 16
Just for info, this is what I do for 32-bit and 16-bit systems:
/* 32-bit OS */
#elif defined(__386__) || defined(__i386__) || defined(__DJGPP__)
#define MIN_SIZE 2
#define SIZE_1 3
#define SIZE_2 4
#define SIZE_3 5
#define DATA_INIT_BLOCKS 36
#define ARRAY_INIT_SIZE 10
#define ARRAY_INIT_CAPACITY 16
/* 16-bit OS */
#elif defined(__I86__) || defined(__86__)
#define MIN_SIZE 1
#define SIZE_1 2
#define SIZE_2 3
#define SIZE_3 4
#define DATA_INIT_BLOCKS 19
#define ARRAY_INIT_SIZE 9
#define ARRAY_INIT_CAPACITY 16
#else
#error Unsupported Operating System, sorry.
#endif
In order not to be completely dependent on only one Operating System,
and in order to know that the fact that I the defines in mem.c
actually serve their purpose, and because I intend to rewrite this
program in assembly for a 16-bit OS, I decided to compile and to test
this allocator on different Operating Systems.
Luckily Linux can compile both 64-bit and 32-bit binaries and run them through Valgrind. This is very important because it allows me to be more or less sure that at least 64-bit and 32-bit versions work correctly.
I have also added some generated content to the allocated areas with a checksum in order to be sure that it does not get corrupted. This way, even if I can not use Valgrind on a 16-bit OS, the fact that it does not report an error is already a good sign.
So these are the Operating Systems on which I have tested my code:
- 64-bit Linux with GCC
- 32-bit Linux with GCC
- 32-bit Linux with OpenWatcom
- 32-bit Hurd with GCC
- 32-bit ArcaOS with OpenWatcom
- 32-bit DOS with OpenWatcom
- 16-bit DOS with OpenWatcom
I have used two compilers in order to compile this project: GCC and OpenWatcom. The OpenWatcom compiler uses an archaic version of C, where variable declarations have to be first in the block, before any code. So in order to compile, I had to modify a lot of things in a lot of places.
In the end I have decided to have two makefiles, for both compilers, because they are very different and in order to preserve some consistency in each file. The first file is GNUmakefile. It’s for GCC, so that it does not read makefile instead. Unfortunately, OpenWatcom does not have a makefile name for itself, so I had to use “makefile”.
So, this was my little project where I tried to learn a little bit about memory allocation and to implement a generalized Fibonacci memory allocator. Unfortunately it’s not enough to be considered a real memory allocator. A memory allocator has to be able to work with multiple threads, which was not the main purpose of this project. For this reason I could not compare it to real memory allocator nor find a test suite which would tell how bad or how good it is. But for my purposes, namely, to become more familiar with memory allocations, and to be ready to write a memory allocator for a 16-bit single-threaded environment it’s exactly what is needed.
See you next time, bye.