The
jemalloc allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.
Memory is conceptually broken into equal-sized chunks, where the chunk size is a power of two that is greater than the page size. Chunks are always aligned to multiples of the chunk size. This alignment makes it possible to find metadata for user objects very quickly.
User objects are broken into three categories according to size:
1.
Small objects are smaller than one page.
2.
Large objects are smaller than the chunk size.
3.
Huge objects are a multiple of the chunk size.
Small and large objects are managed by arenas; huge objects are managed separately in a single data structure that is shared by all threads. Huge objects are used by applications infrequently enough that this single data structure is not a scalability issue.
Each chunk that is managed by an arena tracks its contents in a page map as runs of contiguous pages (unused, backing a set of small objects, or backing one large object). The combination of chunk alignment and chunk page maps makes it possible to determine all metadata regarding small and large allocations in constant time.
Small objects are managed in groups by page runs. Each run maintains a bitmap that tracks which regions are in use. Allocation requests can be grouped as follows.
•
Allocation requests that are no more than half the quantum (see the Q option) are rounded up to the nearest power of two (typically 2, 4, or 8).
•
Allocation requests that are more than half the quantum, but no more than the maximum quantum-multiple size class (see the S option) are rounded up to the nearest multiple of the quantum.
•
Allocation requests that are larger than the maximum quantum-multiple size class, but no larger than one half of a page, are rounded up to the nearest power of two.
•
Allocation requests that are larger than half of a page, but small enough to fit in an arena-managed chunk (see the K option), are rounded up to the nearest run size.
•
Allocation requests that are too large to fit in an arena-managed chunk are rounded up to the nearest multiple of the chunk size.
Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cache line sharing, round your allocation requests up to the nearest multiple of the cache line size.