Automatic garbage collection |
In computing, garbage collection is a system of automatic
memory management which seeks to reclaim memory used by objects
which will never be referenced in the future. It is commonly abbreviated as GC. The part of a system which performs garbage
collection is called a garbage collector.
When a system has a garbage collector it is usually part of the language run-time system and integrated into the language. The
language is said to be garbage collected. Garbage collection was invented by John McCarthy as part of the first Lisp system. When someone refers to a garbage collected language what is meant is that garbage
collection is a feature specified by the language. There are also some languages (mostly formal languages) which would require a
garbage collector in any practical implementation, but whose specifications do not require garbage collection, usually because
they are not intended to be used as practical programming
languages. The lambda calculus is a an example of such a
language.
The basic principle of how a garbage collector works is:
- Determine what data objects in a program cannot be referenced in the future
- Reclaim the storage used by those objects
Although in general it's impossible to know the moment an object has been used for the last time, garbage collectors use
conservative estimates that allow them to identify when an object could not possibly be referenced in the future. For example, if
there are no references to an object in the system, then it can never be
referenced again. This is the criterion used by most modern garbage-collection systems, but see Disadvantages of Tracing
Garbage Collectors below.
While garbage collection assists the management of memory, the feature is also almost always necessary in order to make a
programming language type safe, because it prevents several classes of runtime
errors. For example, it prevents dangling pointer errors, where a reference to a deallocated object is used.
Tracing garbage collectors
Tracing garbage collectors are the most common type of garbage collector. They focus on locating reachable objects,
which are objects that may be referenced in the future, and then discard all remaining objects.
Reachability of an object
More formally, objects can be reachable in only two ways:
- A distinguished set of objects are assumed to be reachable, these are known as the roots. In a typical system these objects
will be the machine registers, the stack, the instruction pointer, the global variables; in other words, everything that a
program can reach directly.
- Anything referenced from a reachable object is itself reachable. (transitivity)
Informally, a reachable object is one that the program could get to by starting at an object it can reach directly, and then
following a chain of pointer references.
Basic algorithm
Tracing garbage collectors perform garbage collection cycles. A cycle is started when the collector decides (or is notified)
that it needs to reclaim storage, which in particular happens when the system is low on memory. A cycle consists of the following
steps:
- Create initial white, grey, and black sets; these sets will be used to maintain progress during the cycle. Initially the
black set is empty, the grey set consists of specially denoted objects known as the "roots" and possibly some additional objects
chosen according to the particular algorithm used, and the white set is everything else. At any time in the algorithm a
particular object will be in exactly one of the three sets. The set of white objects can be thought of as the set of objects that
we are trying to reclaim the storage for; in the course of the cycle the algorithm will remove many objects from the white set,
leaving behind a set of objects that it can reclaim the storage for.
- (This step is repeated until there are no objects in the grey set.) Pick an object from the grey set. Move all the white
objects that are referenced (reachable in one step of following pointers) from the selected object into the grey set. Move the
selected object from the grey set to the black set.
- When there are no more objects in the grey set, then all the objects remaining in the white set are not reachable and the
storage occupied by them can be reclaimed.
The tricolour invariant is an important property of the objects and their colours. It is simply this:
- No black object points directly to a white object.
Notice that the algorithm above preserves the tricolour invariant. The initial partition has no black objects, so the
invariant trivially holds. Subsequently whenever an object is made black any white objects that it references are made grey,
ensuring that the invariant remains true. When the last step of the algorithm is reached, because the tricolour invariant holds,
none of the objects in the black set point to any of the objects in the white set (and there are no grey objects) so the white
objects must be unreachable from the roots, and the system calls their finalisers and frees their storage.
Some variations on the algorithm do not preserve the tricolour invariant but they use a modified form for which all the
important properties hold.
Variants of the basic algorithm
The basic algorithm has several variants.
- There is the issue of whether the garbage collector moves objects in memory (that is, changes their address) or not: moving
or non-moving GC.
- Some collectors can correctly identify all pointers (references) in an object; these are called "exact" collectors, the
opposite being a "conservative" or "partly conservative" collector. "Conservative" collectors have to assume that any bit pattern
in memory is a pointer if (when interpreted as a pointer) it would point into any allocated object. Thus, conservative collectors
may have some false negatives, where storage is not released because of accidental fake pointers. However, this is not a big
drawback in practice.
- There is the issue of whether the garbage collector can run interleaved or in parallel with any of the rest of the system:
the simplest garbage collectors stop the rest of the system while they perform a cycle; they are non-incremental, whereas
incremental collectors interleave their work with units of activity from the rest of the system. Some incremental collectors can
run fully parallel in a separate thread; these can theoretically run on a separate CPU, but
cache issues make this less practical than it may initially appear.
Mark and sweep
Tracing collectors can also be divided by considering how the three sets (of white, grey, and black objects) are implemented.
A mark-sweep GC maintains a bit (or two) with each object to record whether it is white or
black; the grey set is either maintained as a separate list or using another bit. A Copying GC is a
moving GC which, in its simplest form, moves all reachable objects to a single memory area, and then reuses all memory outside
this area. Copying GC has the important benefit of resisting fragmentation.
Generational GC
There is the issue of when to perform a cycle and what objects to place in the initial grey set. A simple collector will
always put only the roots in the initial grey set, and everything else will be initially white.
Statistically speaking, the objects most recently created in the runtime system are also those most likely to quickly become
unreachable. A Generational
GC divides objects into generations and will generally only place the objects of a subset of all the generations into the
initial white set (the grey set being everything else). This can result in faster cycles.
Disadvantages of tracing garbage collectors
The primary problems with tracing garbage collection arise from the nature of how it is invoked. A collection cycle can occur
at any time and can require an arbitrarily long amount of time to complete. While acceptable in many applications, in others for
which timing and quick response is critical such as real-time
applications, it may cause disaster. Incremental garbage collection helps deal with this problem.
The other main problem is that a large amount of garbage can build up very quickly, particularly in functional programming languages, before the garbage collector
has a chance to collect any of it, resulting in allocation inefficiencies, a temporary bloating in the image size, and a long
collection cycle once it occurs; we say that it is not prompt.
To see how tracing garbage collectors are not prompt, consider this example:
a := 1
repeat forever
a := a * 2
repeat a times
x := new object
x := 0
With a naive tracing garbage collector, each time through the loop more and more memory is allocated, until all the memory in
the system is consumed and the garbage collector is forcibly invoked. It's clear why this is unacceptable when we realize that
this code never needs more than the memory required for one object!
Finally, a fault shared by all existing garbage collectors is the conservative assumption that memory is still in use if it is
still reachable. In many systems, references are kept long after they cease to be used. A popular research area is to find less
conservative ways of collecting objects that will never be used again; that is, to safely collect objects to which references
still exist. In systems with standard garbage-collection, freeing of this memory can be accelerated by explicitly destroying
pointers, but this is akin to explicitly freeing memory, defeating the purpose of garbage collection.
As one simple example, consider the command-line arguments passed to the startup function. Typically, the program which uses
them will parse these arguments and then perform a command accordingly, never touching the arguments themselves again. But they
are still kept in memory, because references to them still exist at the bottom of the call stack, in the entry function. Although
these objects occupy minimal memory, similar situations often occur with very large or numerous objects.
All of these are often used as arguments against garbage collection and for explicit memory handling, particularly in
time-critical applications. However, other forms of garbage collection do not necessarily share all of these faults, although the
last is nearly universal, but they may have different faults of their own; all options should be considered when deciding whether
to use garbage collection.
Reference counting
Reference counting is a form of garbage collection where
each object stores a count of the number of references to it, and the object is reclaimed when the count reaches zero. For data
structures which do not involve cycles of reference, such as pointers to strings, it is preferable to other forms of garbage
collection. Even for data structures in which cycles of reference may be present, it's typically able to recover memory more
quickly and in a more timely fashion, despite requiring additional mechanisms to deal with the cycles. See reference counting for more information.
Languages whose standard implementations use automatic garbage collection
External links
|