Recent Changes - Search:

Recently Written

News

edit SideBar

Compress

1.  About data compression

Data compression allows to make files smaller while keeping the content usable. We can distinguish cca 4 types of data compression:

  • Archive compression (ZIP, 7-ZIP, RAR, …) , always lossless. Compressible files can be compressed to save space and later restored. The restored files are supposed to be identical (byte-identical) to the original ones.
  • Executable compression (UPX, APACK, ASPACK, …) . The contents of the compressed file gets uncompressed / restored when loading, partially also restoring the uncompressed file is possible. The relevant data is compressed and restored losslessly, OTOH it might be impossible to restore the complete file to be byte-identical to the original - header details and paddings usually will differ.
  • Lossless multimedia compression (PNG, FLAC, WAVPACK, …). Multimedia content is being stored in files of types including compression. Such files can be used (viewed, played) without decompressing them manually, they are decompressed by the involved app (graphical app, player). But they also can be converted into the original uncompressed file format (WAV, BMP), and the restored content is supposed to be identical (pixel or sample identical) to the original one. Disadvantage: many multimedia data are badly compressible or almost incompressible.
  • Lossy multimedia compression (JPG, MP3, Vorbis, MPG, Theora, …). Multimedia content is being stored in files of types including compression. Such files can be used (viewed, played) without decompressing them manually, they are decompressed by the involved app (graphical app, player). They also can be converted into the original uncompressed file format (WAV, BMP), but the content will no longer be identical (pixel or sample identical) to the original one. The advantages are that badly compressible or incompressible content still can be compressed, much more than with lossless algorithms, and the compression factor can be set, the higher the compression, the bigger the loss. Lossy compression of images and sound achieves factors of cca 10 with a loss considered as acceptable, movies are mostly compressed even more.

See appropriate online resources for further general information about data compression.


2.  Potential pitfalls

Before trying to compress anything, you should be aware of following things:

  • It is not possible to compress every data . Especially data that is already compressed is ( surprisingly ?  :-D  ) incompressible . Exceptions from this rule do exist but are limited to special cases.
  • Multimedia compression is in many cases lossy. Many (badly designed :-(  ) programs save data by default with loss without warning (mostly JPG).
  • Do not use encryption before you know how it works and what you intend. There are already too many “URGENT!!!! HELP needed - forgot my password !!!” requests around.

3.  Algorithms

3.1  Simple algorithms

RLE

RLE (“ Run Length Encoding ”) is the simplest compression algorithm, based on “holes” (large blocks of same bytes like 0,0,0,0,0,… or 255,255,255,… but also 119,119,119,…) in the input file. Is is defined as an optional feature of the BMP image format, but almost never used there (most BMP’s are uncompressed). Most archivers do not use it, for practical reasons probably (difficult to include in one pass ?), however, performing RLE before the more sophisticated (LZ- & Huffman-based) algorithms can result in an improvement. On ordinary files, the benefit is not spectacular, but in special cases, like huge (>10 MiB) “empty” (huge “hole” see above) or almost “empty” files, Deflate and similar methods perform very badly, producing a file that is well decompressible to original, but also well compressible again, breaking the “law” that data can never (almost  :-D  ) be compressed more applying the same algorithm again. Preprocessing such files using RLE could improve the compression (smaller output file & also faster).

LZ77

LZ77 (“ Lempel-Ziv 77 “) was developed in 1977. Searches for repeated occurrence of same strings and replaces such strings by a code pointing back in the data stream / file to a location where the same string has previously occurred, and including the length of the matching string. Mostly used together with Huffman in the Deflate algorithm.

LZ78, LZW84

Developed in 1978 and 1984, based on LZ77, trying to improve it. Both used to be patented, but now (since 2006) all possible patents are expired. LZ78 (“ Lempel-Ziv 78 ”) was never popular, unlike LZW84 (“ Lempel-Ziv-Welch 84 ”) , used in the “.Z” compressor, GIF images, PKARC, old versions of PKZIP and PDF documents. Adobe implemented LZW84 (among other patented algorithms) in its PDF document format, and did this by intention: they did have a license to use this algorithm, and wanted to keep their “exclusive rights” on PDF this way. Big problems came up after one found out (in 1995 and 1999, after many years of using LZW84 and assuming it as “free and safe” ) that it was patented. Now LZW84 is highly obsolete: “.Z” compressor/format was replaced by GZIP (Deflate algorithm), later by BZIP2 or LZMA / 7-ZIP, LZW84 in ZIP’s (0.xx and 1.xx versions) also by Deflate, later 7-ZIP archiver also, and GIF images by PNG, also Deflate algorithm, finally in PDF documents also Deflate was introduced (in PDF 1.2) and is preferred, still LZW84 remains an option and part of the PDF file format.

LZX

Derivative of LZ77 algorithm, invented in 1995 by Jonathan Forbes and Tomi Poutanen for an archiver of same name for the Amiga computer. Originally shareware, 1997 abandoned and turned into freeware, but source code was never released. In 1996 Forbes went to work for Microsoft and “brought” the algorithm (with tiny modifications) to multiple file formats of them, including CAB (“Cabinet”) installation packages, CHM (“Compiled HTML”) documentation files, and WIM installation packages (used in Vista). The format of those files as well as the algorithm are sufficiently well known and publicly documented, and several independent tools (including 7-ZIP) can extract them, but nothing supports the original Amiga LZX archives.

Huffman

The Huffman algorithm is based on the fact, that different values occur in significantly different amounts inside a file. In a text file for example, the lowercase letters “a”, “e”, “s” occur much more frequently than for example “Q”, “X”, and even more non-alphanumeric values. The algorithm allows to assign to every input byte value (8 bit in size) a variable length (cca 2…15 bits) output value, while keeping the act reversible, and, why we do this, having the output file smaller (many of the short (2…6 bits) codes, very few long (>8 bits) ones) than the input one.

Arithmetic, Range

Two almost identical algorithms, but with one critical difference: “Arithmetic” is patented (by IBM ?), while the “Range” algorithm is considered as unpatented. Both are improvements of Huffman, they compress better but slower that it, and both do give almost same results in speed and output file size.

3.2  Combined algorithms

These algorithms consist of multiple of previously named simple algorithms, partially including some additional (not providing compression when isolated) processing.

Deflate

Invention & General

Very popular algorithm doing LZ77 and Huffman in one pass. Used in “standard” ZIP (PKZIP 2.xx) archives, GZIP compressed files, PNG images, some (older) Windows installers and PDF documents (since PDF 1.2). Good algorithm descriptions with sample sources are available (RFC 1951 from 1999-May). Quite “cheap”, usable on CPU’s down to Intel 8086 with a few MHz and 512 KiB RAM.

Late life

Deflate is the most popular compression algorithm of all times. Despite its compression is substantially inferior to newer algorithms like LZMA, it is still popular after more than 20 years of life. In in some fields of usage it has been mostly replaced (Windows installers), is some fields partially (Linux source packages), while in some fields it is still dominant (lossless image compression - PNG, most trouble-free and compatible archives - ZIP). Several attitudes towards its inferiority can be observed:

  • ignore the inferiority - continue using old tools like PKZIP 2.xx or old code
  • improve it - PKWARE introduced (already in 1998) Deflate64, providing a minor improvement of compression. The algorithm is “internally” very similar to Deflate, it just uses 64 KiO look-back distance instead of previous 32 KiO (optimized for 8-bit and 16-bit CPU’s), nevertheless it breaks compatibility.
  • replace it - use for example LZMA instead
  • optimize compression while staying compatible - over the years many alternative implementations of Deflate appeared (KZIP&PNGOUT, 7-ZIP’s one, WinRK’s one, Zopfli, … see below) providing usually a bit better compression than PKZIP and Info-ZIP. The benefit are smaller and perfectly compatible files, the disadvantage much longer time needed to compress.

See below about implementations.

BWT/BZIP/BZIP2

BWT (“ Burrows-Wheeler-Transform ”) itself does not compress, it only “mixes” data in a “magic” way to make them more compressible using LZ- and Huffman-style algorithms. It was developed by Mike Burrows and David Wheeler. It performs best on huge source code packages, the drawback is that it is more vulnerable to “special” (highly repetitive) input data than other algorithms. BZIP and BZIP2 are compressors/algorithms written and maintained by Julian R. Seward. In 1995 BZIP was released, a compressor based on BWT, LZ77 & Arithmetic compression algorithms. J.R. Seward soon found out that the Arithmetic algorithm is patented and had to drop it and go back to classical Huffman  :-(  , worsing the compression by cca 1% and named the result BZIP2. It is released under a liberal open source license and unpatented. BZIP2 is positioned somewhere between Deflate and LZMA in compression factors achieved, speed and memory requirements. A few MiO RAM is required for compression (depends on settings), less for decompression. Could work down to 80286 with XMS or 8086 with EMS. There are also some experimental stronger compressors based on BWT, but BZIP2 is the only implementation with practical use.

LZMA

Invention & Usage

LZMA (“ Lempel-Ziv-Markov-Chain-Algorithm “) is an improvement of Deflate, developed by Igor Pavlov , used in eir 7-ZIP product. Instead of “old” Huffman, the “Range” algorithm is used, and instead of LZ77 with 32 KiO sliding window, sophisticated match finders supporting dictionaries of many MiO in size are used, also a “Markov-Chain” algorithm is involved. Later the algorithm “leaked” into other products, most notably UPX executable packer, and NSIS and INNOSETUP installers (both Win32 only, but at least extractable in DOS also).

Description, LZMA2

Unfortunately, no good algorithm description is available so far. For years the only source of information was the source code of 7-ZIP program, the LZMA “toy” compressor (was intended to change into a serious GZIP/BZIP2 replacement, but maybe this will never happen because of competing attempts named XZ and LZIP), and LZMA SDK source, all written in “C++” , with all the speed optimizations and using multithreading (optional only ?), besides this there was a “simplified but compatible” ANSI-C source of decompression only. Things changed a bit with 7-ZIP version 4.58 beta: Igor rewrote the LZMA compression and decompression code from C++ into “plain” C (reason: performance, rest of 7-ZIP application remains in C++), so now plain C code is the “reference” implementation, but still no text in a “human” language. LZMA algorithm is excellent for highly compressible data, OTOH it doesn’t perform well on incompressible or badly compressible one. In practice, LZMA will expand such data more frequently and by more than for example Deflate. LZMA2 algorithm (available in 7-ZIP versions 9.xx, earliest stable one is 9.20 from 2010-Nov) supports uncompressed blocks, addressing this problem of existing LZMA, and is incompatible with it of course, and not extractable with older versions of 7-ZIP. For years the license of LZMA SDK was GNU LGPL with a few very minor exceptions, with version 4.61 beta of 7-ZIP and LZMA SDK, it was changed into Public Domain .

Cost (technical)

The algorithm is slightly a memory hog - cca 64 MiO RAM is the minimum for reasonable compression, and also the CPU should be at least a 80486 with 50 MHz - not suitable for very old PC’s. The decompression is much “cheaper” and can be performed with a few MiB RAM (depends on dictionary size set while compressing) on a 80386 or even below with XMS or EMS, if someone ported the code for such systems. Even “worse”, UPX reportedly can decompress LZMA even on 8086, very slowly, when using a very small dictionary.

PPMD

Algorithm by Dmitry Shkarin . Project page: compression.ru/ds . Implemented in 7-ZIP, optionally can be used instead of LZMA.

Misc

Misc …

3.3  Multimedia algorithms

Multimedia algorithms …

See appropriate online resources for further information about multimedia compression algorithms. DCT/MDCT/IDCT Wavelet WVT/IWVT


4.  Encryption

Many archivers do offer besides compression additionally encryption of data. There is symmetrical and asymmetrical encryption available, “classical” archiving with a password uses symmetrical one. Many older archivers (ARC, PKZIP 2.xx) use poor algorithms and have critical weaknesses, newer products (7-ZIP) theoretically are very secure, moving the risk to other factors, like the “human” factor and usage of risky OS’es (like Windows). If poor encryption is sufficient (hiding viruses from antivirus programs for example :-D or texts from text search), PKZIP 2.xx algorithm is preferable. For secure encryption, 7-ZIP is the right product, using the 7-ZIP archive format, a good password and doing so on DOS.

See appropriate online resources for further information about encryption.


5.  Processing steps

The “steps” are rather theoretical - in most archivers they are all performed automatically in one pass, on Linux piping is partially used to “chain” them.

5.1  Non-Solid archiving

  • compress every single file
  • encrypt the files, every file separately [OPTIONAL]
  • compose them together
Examples: ARC, ZIP.

5.2  Solid archiving

  • compose the files together
  • compress the resulting big “file”
  • encrypt the file [OPTIONAL]
  • add redundancy for recovery [OPTIONAL]

Solid archiving can be:

  • 2-stage

The big file resulting from composing the files can be saved and accessed by the user. Example: TAR followed by GZIP or BZIP2. The TAR file is accessible.

or

  • 1-stage.

The big “file” is not accessible, all steps do occur in memory in one pass. Examples: RAR, ACE, 7-ZIP with solid option on.


6.  Archivers

6.1  ARC

The original: ARC = ARChive. Released 1986, maintained by two (!) companies: PKWARE as “PKARC” and SEA as “SEA-ARC” . Provided as shareware exclusively for the only acceptable OS that everybody had at that time : MS-DOS. Copyright conflicts between PKWARE and SEA resulted in death of ARC format and SEA company in 1990, while PKWARE introduced ZIP file format and PKZIP+PKUNZIP programs, and became quickly very popular with those.

Some more info on Wikipedia: en.wikipedia.org/wiki/Phil_Katz en.wikipedia.org/wiki/ARC_(file_format)

ARC related post by Rugxulo : mail-archive.com/freedos-user…10457.html

6.2  ZIP

Format creation by PKWARE, PKZIP for DOS

Introduced by PKWARE as replacement of ARC. After some experiments with the ZIP file format and hacking on the compression algorithm in versions 0.xx and 1.xx, PKZIP 2.04 for MS-DOS was released in 1993, as shareware again. It supports file sizes up to 4 GiB, the Deflate compression algorithm, CRC32 checknumbers for integrity verification, and a sort of “encryption”, which however is rather poor, see also the “Encryption” section. Also it tries to achieve maximum speed through using, if available, EMS, XMS, DPMI32, and 32-bit 80386 or 80486 instructions. On the other side it should work on 8086 too.

Info-ZIP

A free and open source implementation of the Deflate algorithm and ZIP file format, available under a BSD license. The (useless) “encryption” was originally not included (available only as a separate patch for the source code) because of US “cryptography export” restrictions, later the restrictions were reduced allowing to include it.

WinZIP

PKWARE had the intention to maintain exclusively the ZIP standard, however, they made the file format and the algorithms open (the speed-optimized code was always closed source). This made ZIP to a quasi standard of archiving, and allowed development of ZIP-compatible archivers by other people and companies, but also allowed some people to create the “ WinZIP ” product, that, having the magic word “Win” in the name, “hijacked” the standard and made “WinZIP” the most popular archiver and turned PKZIP into a rather marginal and historical thing. WinZIP started with 16-bit code on on Win 3.xx, changed to 32-bit with Win95, but it required PKZIP & PKUNZIP for many years, it did not contain any compression code at all  :-D  , finally very late (version cca 7 ???) Deflate code (picked from Info-ZIP project) was added removing the PKZIP & PKUNZIP requirement. For some time other DOS packers and unpackers (ARC, LHA) were “supported” (allowing WinZIP to “support” those formats), finally also they got dropped, in the meantime various compression algorithms were added (picked open source libraries with sufficiently liberal licenses). Interesting: Old WinZIP self-extractors are dual-mode executables, working on DOS and Windows (file structure: MZ … NE … PK !!!). PKWARE then also changed to “Windows” (native Win32 console and GUI binaries), but late - “too late” as many people say.

Competitors beating ZIP format, “extended” ZIP, zipx

Besides WinZIP, other “Win”-based archivers were created, especially WinRAR, offering new archive formats with better compression, solid archiving, stronger encryption and redundancy/recovery. With a big delay, PKWARE & WinZIP maintainers tried to react to the competition and introduced, partially independently, “extensions” to the ZIP standard: Deflate64 for (marginally) stronger compression, later BZIP2 and finally PPMD and LZMA algorithms, ZIP64 for files > 4 GiB, additional encryption algorithms (RC2, RC4, DES, 3DES, finally Rijndael (AES), in 2 different incompatible implementations  :-D  ), and special handling of some multimedia files (WAVPACK algorithm for WAV files, and an algorithm for lossless recompression of lossy JPG pictures). As result, they generously messed up the ZIP standard. A ”ZIP archiver” supporting all this formats and extensions is bloated, complicated and very difficult to make and keep bug-free. Even later (means: “too late”) the zipx file format was defined - it is just a “ZIP” with any of aforementioned and already previously implemented extensions (except Deflate64 or ZIP64 ???). Those extensions have been mostly (not fully) implemented in the 7-ZIP archiver, also they are slowly leaking into Info-ZIP and other archivers.

From “ http://www.winzip.com/comp_info.htm ” :

“ The PPMd compression format was introduced in WinZip 10.0 Beta, released in August 2005. The WavPack compression format was introduced in WinZip 11.0 Beta, released in October 2006. The compressed Jpeg format was introduced in WinZip 12.0, released in September 2008. In WinZip 12.1, released in May of 2009, the Zipx file was introduced. The Zipx file is a Zip file that uses any of the aforementioned compression methods or the LZMA or bzip2 compression methods as documented in the Zip file appnote.txt specification. ”

PKZIP and DOS

The latest DOS version of PKZIP is 2.5 , released in 1999. It is optimized for running in faked “DOS” boxes (Win98, with LFN), supports newer CPU’s (Pentium, should run faster on them), and as undocumented feature, it can extract files compressed with Deflate64. Unfortunately it seems to have problems with XMS/DPMI/CPUID handling - it can misdetect a Pentium as 80486, and even worse, crash is some situations with XMS and DPMI present - this problem should be fixed in HIMEMX 3.32, so avoid older versions, most notably the “official” FreeDOS 1.0 HIMEM 2.26, at least with PKWARE. It can not compress Deflate64 and also does not improve the compression compared to version 2.04, the new “-exx” switch has no effect on most files. Other known problem: PKUNZIP for DOS (all versions) and 2.50 for Win32 may falsely refuse to extract ZIP’s created on Linux.

Other ZIP archivers, KZIP

The archivers supporting ZIP format (the standard one) vary in compression performance and achieved size reduction. PKZIP is fast and has good compression, but still leaves space for improvements (see above “Late life” of Deflate). Using 7-ZIP one can increase the compression effort (still referring to standard ZIP) and achieve better compression while keeping compatibility. There is also a product named KZIP (right: there is no “ P “) written by Ken Silverman , closed source freeware. It offers the probably best PKZIP 2.xx compatible compression, at cost of speed. It runs in DOS using HX-DOS Extender. One more interesting product is TUNZ, an UNZIPper written in ASM (no source release yet), only 2.5 KiO in size (DOS .COM executable), 8086 compatible. It however has some limitations about number of files in the archive and subdirectories, and can extract only all files of the archive together, no “selective” extract.

Other usage of the ZIP format

The ZIP format is being used for “other” file types too, most notably JAR Java packages, Open Office ODT documents, and DOCX documents of MS Office / Word 2007 and newer, see DocumentFormatsViewers. So any archiver supporting ZIP can extract those files, still this doesn’t mean that the result will be human-readable text, but usually you can at least extract images this way.

6.3  RAR

Developed by Eugene RoshallRAR = Roshall’s ARchiver ) and maintained by em up now. Started cca 1995 as “RAR” for DOS and changed soon to “Windows”, named “WinRAR” then. Has been always an innovative product, introduced better compression than ZIP (and still improving) at acceptable speed, one-stage solid archiving, redundancy data for recovery and strong (closed source  :-(  ) encryption. The algorithm is and always was proprietary and closed source, and RAR and WinRAR products shareware, but there is a freeware UNRAR program for different platforms, including DOS available. Also the UNRAR code is open source, with the restriction that you may not use it to reconstruct the RAR algorithm from it. A minimal commandline RAR for DOS is available, also as shareware at same cost as WinRAR with is expensive GUI. Having free 7-ZIP available and working in DOS also, RAR became quite obsolete.

6.4  ACE

Developed by Marcel LemkeACE = Advanced Compression Engine ??? ) in cca 1996. Used to be an innovative product providing very good compression at acceptable speed and some other benefits. Versions 1.x did provide a free & open source UNACE, versions 2.x do no longer (unreproductable license change). This license issue together with coming up of 7-ZIP made ACE popularity sinking and the product and file format obsolete.

6.5  7-ZIP

Creation

Developed by Igor Pavlov in late 1990′s and based on eir LZMA compression algorithm. After year 2000, the product became stable and usable and its popularity has been slowly increasing all the time. Supports its own 7-ZIP archive format as well as some other popular formats: the “standalone” console version supports: 7-ZIP, ZIP (with some of the new and obsolete extensions, like Deflate64), GZIP, BZIP2, TAR, Z (very obsolete, LZW84 algorithm, extract only). The DLL-based console version and Win32-GUI one also support some additional archive formats, like RAR (extract only), CAB and WIM (extract only, since 4.57), FAT and NTFS (since 9.20, hard disk filesystems) and ISO (CD filesystem, partially, extract only). The Win32 GUI version provides a simple 2-panel file manager (WinZIP freaks do not like it  :-D  ).

Later history

The latest 3.xx version was 3.13 from 2003–12–11 , then the 4.xx line began, the latest 4.xx version is 4.65 from 2009-Feb-03 . Due to year 2009, Igor decided to bump the major version number to 9, and the only stable version is 9.20 released 2010-Nov-18 . In year 2015 the major version number was bumped to 15, the only stable versions are 15.12 from 2015–11–19 and 15.14. Meanwhile version 16 is out.

7-ZIP archive file format

The 7-ZIP archive format provides excellent compression using the LZMA algorithm (also LZMA2 since versions 9.xx), alternatively also PPMD, BZIP2 or Deflate, one-stage solid archiving (risky, not everybody likes it), strong encryption (Rijndael algorithm, 256-bit key, large amount of SHA-256 hashes - 512 Ki of them as in version 4.58, amount supposed to grow in future, while keeping compatibility with older versions of 7-ZIP ), and support for unreasonably huge file sizes (many TiB’s).

7-ZIP and DOS

Unfortunately, Igor Pavlov never provided a DOS version, only a Win32 GUI and a Win32 console one. Also the so called “standalone” console version uses multithreading and can not be compiled to DOS in a trivial way. But an external developer, Japheth , created a “HX-DOS extender” product, allowing to use many Win32 console apps in DOS, even those with multithreading. 7-ZIP was one of eir privileged apps and ey made it working excellently in DOS. An other developer, Blair, also got 7-ZIP working in DOS in another way - ey took the “p7zip” product, the posix (Linux and similar systems) version of 7-ZIP, performed some minor fixes in the source and recompiled with DGJPP and its “pthreads” emulation library. The result works, there are only minor problems, like lazy progress indicator and bloated executable size. In the past ey ported versions 4.32, 4.33beta and 4.42 (4.42 is included in the FreeDOS 1.0 distribution, all those versions are no longer (separately) available ?), later 4.55 and 4.57 became available, also other people having compiled and released some [p]7-ZIP ports for DOS are Mik & Rugxulo , see links below. Actually, since 7-ZIP / p7-ZIP v. 4.37, DGJPP is one of the “official” platforms supported by “p7zip” project. Still those various ports expose various problems like incompatibility with HDPMI32 (disable DPMI 1.0 ???), does not work on FreeDOS (???), creation of ZIP’s tagged as “created on Linux” (PKUNZIP refuses to extract them, other UNZIP tools are fine, other archive formats don’t expose this problem). Also the DLL-based “full” commandline version, supporting those additional formats, works in DOS using HX-DOS (seems to work, not tested too much). So far there is no benefit from 7-ZIP’s huge file size support in DOS, maybe one day file sizes up to 256 GiB (is it poor ?  :-D  ) will be possible in DOS, and only on FAT32+ partitions, after FAT32+ support will be added into the DOS kernel (Udo Kuhnt’s EDR-DOS has it implemented since 2006 August WIP, FreeDOS not yet) and HX-DOS extender (not yet done). By now the limit is 2 GiB, usage of 2GiB…4GiB files in DOS is sort of possible on FAT32 but problematic. Because of support of other (than 7-ZIP) formats, 7-ZIP archiver almost obsoletes ZIP (PKZIP & PKUNZIP, Info-ZIP), GZIP, BZIP2 & TAR archivers, if one accepts the need of HX-DOS and the CPU requirements (down to cca 80486 - 4.58 is verified to work on 80486 SX without FPU, no tests on 80386).

6.6  TAR, Z, GZIP, BZIP-2

These archivers originate from Linux and are still very popular there. Some people speak of “ Linuxed archive ” when seeing such a file, however, they are in (almost) no way specific to Linux and well usable on other systems and DOS also. TAR does not compress files, it only composes them together, the resulting file is supposed to be compressed using Z (very obsolete, using LZW84 algorithm), GZIP (GZ, Deflate algorithm) or BZIP2 (even newer) or LZMA (newest, but format for LZMA files (unlike LZMA compression algorithm and 7-ZIP file format) is not yet finalized). This is a 2-stage solid archiving. There are also archivers performing TAR and the compression in one pass (“piping”) without storing the huge TAR file on a disk. It is also possible to compress the TAR with other archivers as well, like ZIP (benefit: weak “encryption” possible, while GZIP has none) or 7-ZIP (benefit: better compression, possible strong encryption). TAR, GZIP and BZIP2 all do have 32-bit DOS ports compiled with DGJPP, TAR and GZIP also some 16-bit ports or clones. The current version of BZIP2 is 1.0.6 from 2010-Sep-20, however there haven’t been previously any spectacular changes (compression improvement) for years since cca 1.0.2 . It requires a 80386 CPU and some MiO of RAM, at least theoretically it could run also on a 80286 or 8086 with XMS or EMS if someone ported the code to such systems. TAR and GZIP are “cheap” enough to run even on a 8086 with 512 KiO RAM. The best way to handle these archives on new PC’s (80486 and above) is the 7-ZIP archiver, supporting them all, and ZIP and 7-ZIP additionally.

6.7  Misc archivers

The is a huge amount of other archivers available, never or no longer having a big popularity, few examples:

  • LHA (old, DOS, had some popularity, now obsolete)
  • ARJ (old, DOS & Win32 console, had some popularity in the past, used for Arachne APM’s, see Browsers)
  • PAQ (GPL, new, very strong, experimental, slow, memory hog)
  • WinRK (commercial, new, slow, memory hog, buggy, Win32 GUI only  :-(  )
  • KGB (GPL, based on PAQ, offensive name, some additions to PAQ, especially the Win32 GUI  :-(  ), …
  • UHARC (closed source freeware, new, DOS & Win32 console, product maybe good, but “marketing” very bad)
  • SIT (commercial, mostly/originally MAC, also Windows GUI, recompressing JPG (?), see paragraph about JPG in GraphMediaTech )

6.8  Deflate optimization tools

Stay compatible and become smaller at same time

See above “Late life” of Deflate.

KZIP and PNGOUT

(see above about KZIP, GraphMediaTech about PNGOUT)

DeflOPT

Freeware, closed source. It takes already compressed data as input (ZIP, GZIP, PNG). It’s a bit a mystery what it does and how, but does work and is lossless.

Efficient Compression Tool

Zopfli

Compression library, some binaries creating or optimizing ZIP, GZIP, PNG are available or can be compiled.


7.  I got a file of “.XXX” type - what now ?

If someone sends you a file or you find a file compressed with an obsolete or unpopular/unknown archiver:

  • Some seemingly obscure archives are actually ZIP (most notably JAR - Java archive) - look into the file (text + hex dump) before desperating or trying anything else, also misnaming of ZIP files (to ARC ARK GIF or whatever) occasionally occurs
  • Try to find a legal download of the extract utility (available for ARC, LHA, ARJ, …)
  • A semi-legal console UNACE for DOS is available
  • Use a trial to extract (WinRK, not available for DOS) :-(
  • Write to the sender/uploader and ask em, why ey uses that archiver and if ey could send/upload a PKZIP 2.xx compatible ZIP or a 7-ZIP archive instead. Good to know: almost every archiver CAN create PKZIP 2.xx compatible ZIP’s, if you insist: in non-ZIP ones, one has just to deactivate its preferred format (RAR in WineRAR, ACE in WineACE), and in all (applies mostly to ZIP-ones, like WineZIP 9 to 15, less also to the non-ZIP ones) make sure that the created ZIP file is PKZIP 2.xx compatible: No ZIP64 format, no Deflate64 or BZIP2 algorithms, no strong encryption.

8.  Executable compression

8.1  About

Executable compression is a controversial “technology”. It can make executables (possibly also DLL’s, and even “drivers”) looking smaller (can be massively smaller), but there are disadvantages as well.

Pro’s:

  • Saves space on the disk (especially useful on floppies)
  • Faster loading (definitely true for floppies (except LZMA on 8086), discussable for hard disks)
  • Protects from corruption - if a packed executable gets corrupted, the built-in unpacker usually whines revealing the problem, if executable is not packed, there is frequently no integrity protection at all and if such a corrupt executable is run “anything can happen” (hang, broken graphics, GPF’s & Co …)
  • Protects from cracking / patching (but not well, since everything can be unpacked, with official or unofficial unpackers or manually, also UPX license prohibits such usage)
  • Reduces size of packages (this assumes that the executable packer has a significantly better algorithm than the archiver, and the archiver can’t take benefit of solid compression, may not always apply)

Con’s:

  • Floppies are dead, executable compression is obsolete
  • Hard disks are huge and the total space occupied by executables (+ DLL’s) is negligibly small
  • Slower loading (depends from hard disk and CPU performance, especially LZMA decompression on 8086 is desperately slow, may seem to hang)
  • Executable compression can increase CPU requirements (UPX by default creates decompression code for 80386, if your code is supposed to work or at least abort with an error message on 8086, the "--8086" switch is needed, otherwise it will just hang)
  • Conflicts with DEP (Data Execution Prevention) (Windows only, and some packers only, reportedly ASPACK is affected, UPX not)
  • Breaks memory management (Windows only, “broken” means increased usage of physical memory and decreased “overall system performance”, especially if you pack system DLL’s, but it still works correctly)
  • Virus complaints (false positives, and false / invalid bug reports to packer maintainers)
  • Packing doesn’t remove any bloat, it adds bloat of the unpacker, and just hides bloat of the compiler or programmer
  • Increases size of packages, an archiver can take benefit of solid compression if there are multiple similar files (source + binary with same texts in, same startup/compiler overhead, multiple versions, …), OTOH if every binary is compressed separately (no way to do solid), the subsequently applied archiver will miserably fail
  • Wastes server space if the same package is provided twice with packed as well as with not packed binaries to make everybody happy
  • Licensing issues (combining the decompression stub with the program)
  • Prevents “looking” at the executable (version, texts) - find out what it is supposed to be

8.2  UPX

A famous product, providing the possibility of decompression to original (“equivalent”, but still not necessarily byte-identical) file as an important official feature. Supports 16-bit DOS .COM and .EXE , .SYS drivers, .SYS/.EXE “combos” , 32-bit WATCOM/LE, DJGPP/COFF, Win32/PE and many other non-DOS formats. The UPX license prohibits (or tries to) any manipulations/cracking of the decompression stub, like hiding the usage of UPX or preventing easy decompression. Unfortunately, in PE files, UPX stores some info in the PE header outside of “official” fields, and applying PESTUB (see HX-DOS ) on it has (accidentally) exactly this effect - prevents easy decompression. Product license is “semi-GPL” - it uses a proprietary compression algorithm NRV - alternatively, one can compile UPX (on Linux only (?), very difficult) oneself, however only the weaker algorithm UCL is available then. Since version 3 (tested in 2.9xx versions), also LZMA is available - for big files LZMA is the best, for smaller ones (below 100 KiB cca) , NRV is better. LZMA is available also for real mode and 8086 (don’t forget to specify "--8086" switch), but decompression is very slow, so it’s doubtful whether this can be considered as an achievement at all.

8.3  APACK

APACK is a 16-bit real-mode DOS executable ( .EXE and .COM ) compressor by Ibsen Software / Jorgen Ibsen . Latest version is aPACK v0.99b from 2000–09–24. It is closed source, free for personal use (only). Some FreeDOS utils are compressed with it, there have been however hot discussions whether it is legal / GPL-compliant or not to “link” APACK’s closed source cca 160 bytes (!!) “stub” with GPL’ed FreeDOS code, without final clarification :-D There is no official unpacker.

8.4  ASPACK

This product brings nothing good (commercial, Win32 only, no official unpacker), it is sort of “relevant” because of troubles it causes when running apps compressed with it using HX-DOS Extender.

8.5  PEtite

Another Win32 PE packer (silly note: “petite” is the French word for “small”). Executables packed with it (example: “PHATCODE.EXE”) don’t work with HX-DOS, reason is unknown.

8.6  Unpackers, IUP

APACK unpacker

For .COM only, FASM source included. board.flatassembler.net/topic.php?t=7278

ASPACK-Die

Unpacker for ASPACK. Win32 GUI, doesn’t work on DOS by now.

IUP

Intelligent UnPacker is a generic unpacker for DOS .COM and .EXE , using the Debug/SingleStep CPU mode to track the unpack process allowing to save the unpacked file then. Supports PKLITE, APACK and many other DOS real mode executable packers. Unfortunately doesn’t work in FreeDOS (reason: wrong usage of INT $21 / AH=$5A “create temp file” function, together with lack of correctness of all (!) DOS kernels, “fixed” in later FreeDOS kernels by adjusting them following other ones where IUP “happens to work” despite the bug), no problem in EDR-DOS. No project page, download from here:


9.  Multimedia compression

Info on Multimedia compression is available at GraphMediaTech.


10.  Faked compression

The topic data compression fascinates many people. Among many more or less serious releases participating in the “compression race”, there have been a few that have to be called “faked compression”, either for fun or (commercial) for fraud. Those “products” pretend to achieve better compression than they acually provide or that is even doable. The 2 ways to accomplish this are:

  • lossy compression (can “work” on multimedia data only)
  • hide the data somewhere (hidden files, unused clusters) and retrieve it from there upon “decompression”

It’s easy to check whether a compression product is “honest” or not. Lossy compression can be detected using some hash (MD5 for example, must be same for original and decompressed file), hiding data can be detected by transferring the compressed file to another computer and decompressing it there. If it decompresses hapilly on the same computer, but fails on the other one (reports error, or output is lossy or garbage), then this is a strong evidence of hiding data. One example of faked compressor of category “fun” is BARF, and one of category “fraud” is “Infima Archiver”.


11.  See also

Edit - History - Print - Recent Changes - Search
Page last modified on August 03, 2016, at 07:21 AM