2007-11-29

Saving space

On my Macintosh, there are two sources of disk bloat: international language support, and universal binary support. When an application is distributed, there is a resource folder that contains one or more files for each language supported by the application. For example, iChat supports Dutch, English, French, German, Italian, Japanese, Spanish, Danish, Finnish, Korean, Norwegian, Portuguese, Swedish, and two variants of Chinese.  That is, of the 18.5 MB taken up for language support, over 90% is for languages other than English. This phenomenon is true of most Macintosh applications, although not all support as many languages as iChat. The sum total of language files in my /Applications folder alone is currently 2.5 GB, up to 90% or more of which is for languages other than English.

Another source of bloat is the universal binary. This is a method, used for executables and object libraries, of packing several different versions of a compiled program into a single file. For example, in the Macintosh world, there can be Power PC, 32-bit i386, and 64-bit i386 versions of all compiled programs and libraries. In my /Applications folder,  452 MB are used for executables, with another 82 MB in /usr/lib for dynamic libraries, so just over .5 GB for this purpose, with around half or more used for non-native CPU architectures.

Personally, this doesn't really bother me that much. I have 100 GB on my hard drive, and newer systems generally have much more than that out of the box. Expending 2.5 GB or so on international and universal CPU support isn't a problem. However, many users are annoyed by this state of affairs, and there have been many hacks proposed to delete foreign language support and to remove non-native CPU support. However, these hacks can mess up the software maintenance process in various ways, ande, a software update can undo the effects of the hack.

In any case, I think that this issue deserves to be taken seriously at the system design level. In my view, a decent compromise would be to allow users to enable auto-compression of the less-frequently used components of their system. What follows is a proposal for a system-level change that could accomplish this fairly easily.

The first and most important thing would be to build in expansion of individual compressed files and folders to the software libraries or frameworks at a low enough level so that the process would be transparent to most programs. In effect, there would be a bit right in the inode of a file or directory indicating that it is compressed. (There could optionally be some other bits indicating the compression mode.) That is, the user would see no difference between a compressed or uncompressed file system resource; the standard frameworks would invisibly expand compressed files. In addition, auto-compressing or uncompressing a file system object should not cause any of the time stamping on the object to change.

For reasons of efficiency, a few programs would work with the compressed objects directly; for example, utilities such as find(1), and GUI file system browsers such as the Macintosh Finder, shouldn't require that files or folders be expanded. That is, there should be support built in for shallow access of auto-compressed file system objects in their compressed state.

However, most programs would trigger auto-expansion of the file system objects they touch. If you open a file in a text editor or word processer, it would auto-expand. If you compile a source file, it would expand. If you execute an executable, it would expand.

Each file system object keeps track of how long it has been since it was last read or written. A system daemon needs to scan the file system at low priority in the background, and compress file system objects that have not been read or written recently, where "recently" can be defined programmatically. For example, the idle time required for an object could be a function of individual files or folders, of file types, of file ownership, the amount of free space on the disk, and so on.

Under this scheme, all applications and files would be installed in their compressed state, and even the system would have everything compressed initially. As the user began to use the system, things would expand. If a resource wasn't used for a while, then (if this functionality is enabled), it would be autocompressed by the daemon.

A related functionality would be targeted toward the elements of universal binaries. This system would compress those elements that have not been accessed recently, and uncompress them as needed. 

In effect, there would be a trade-off between disk space and execution time. If the compression is set too aggressively, you'll save lots of disk space, but your system will be spending a lot of time compressing and expanding files, and will be slow. However, a correct balance will buy you disk space but cost very little in time. For example, I do not read any of the non-Latin character set languages, so I would rarely access the .lproj folders associated with them; they would probably all stay compressed at all times. On the other hand, the English files would be accessed frequently and would rarely qualify for auto-compression.

The situation would be even simpler for universal binaries, since in almost all cases, only the native architecture would be used on a given machine. The exception would be a file server with clients of different architectures; in that case, several architectures would be expanded.

This change could be done in a fairly straightforward manner, I think. However, I must admit that I would probably not enable it on any of the Macintoshes that I own or administer. As I stated in the introduction, the amount of space used to support internationalization and universal CPU architectures is small as a percentage of modern disk space. If a given system was actually running out of disk space such that this overhead became critical, the correct solution, in my opionion, would simply be to upgrade the hard drive.

No comments: