Recent Changes - Search:

Recently Written

News

edit SideBar

DocumentFormatsViewers

1.  About documents

Documents are being exchanged in many formats. Some of them are simple, but many, especially the recent ones, are complicated. All notable formats are publicly documented by now, but some of them incompletely only, or the specification carries a license restricting the usage of such in some ways. Usually the specification is provided in “it’s own” format: PDF in PDF or DOCX in DOCX. Also for many newer formats there is only a bad or no support in DOS.

2.  Document formats

2.1  Plain text - TXT

The simplest one, file holds just the plain text, as well as “end of line” characters, partially also “end of page” or TAB. Viewable with any text editor or viewer. One can use some graphical or “semi-graphical” characters to create tables or simple graphs, however such texts are a pain to maintain, and usage of “non-standard” characters can cause viewing problems (see below). In the past (80′ties and early 90′ties) many “important official” technical documents, like Intel processor or VESA specifications, were published in TXT format, however later the creators switched to DOC or PDF. Despite the simplicity of this format, some pitfalls apply:

  • Text documents may be ASCII/ANSI or Unicode/UTF-8 (see below)
  • TAB usage dilemma
  • LF-vs-CRLF dilemma (good viewers and editors support both or can even convert, bad example: Windows NOTEPAD)
  • TXT-vs-DOC and file extension mess: the preferred file extension for plain text files is “.TXT” . In the past also “.DOC” was common, but it causes confusion with Word DOC and similar formats (see below). Other habits observed are:
    • “.DOC” for English and “.DOK” for German text
    • “.EN”, “.DE”, “.RU”, … - language specific extensions
    • No extension at all (“BUGS”, “TODO”)
    • Usage (or abuse) of the “extension” part for whatever additional information: “README.Win32″, “README.Compilation Problems” (note the space)
  • Nice or broken graphics and fixed-vs-proportional font dilemma (proportional fonts will break tables and graphics, even if text is built from “safest standard” characters only)
  • Line length and wrapping (“to-wrap-or-not-to-wrap-dilemma”) (some editors can get extremely (!) slow when facing unwrapped text with very long lines, or they can truncate such lines or crash)
  • Termination characters ($1A or ZERO): some texts do contain such characters (not only just at the end), and some editors can silently truncate the text due to them
  • Faulty characters (ZERO, 128…159 and others)

2.2  Embedding images

Most formats (except TXT) allow to embed images. This can be accomplished by 3 ways:

  • Place images into separate files (HTML) - risk of “loss” of the files
  • Embed the complete image files (PNG, GIF, JPEG, …) “as-is” into the document (DOC, DOCX, ODT) - extracting them is possible by Un-ZIP’ping (DOCX, ODT, see below) or “file in file” search
  • Reuse popular image compression technologies, but don’t embed complete files (PDF) - extraction is difficult and needs special tools

A common problem with images inside documents is that they can display badly or not at all in some viewers, despite this is supposed to be supported. The reason for this effect is that most document formats (including “good” ODT) do support a very wide range of image formats (PNG, JPEG, … but also WMF, EMF), every of them with many flavours (see GraphMediaTech). The software creating the document accepts almost “anything” and doesn’t bother to perform any validation or standardization, the software for viewing can face uncommon formats or flavours of them and fail to display the image correctly or at all. Another even more dumb reason for images not showing is that complete embedded files are misnamed or mislabeled, for example “blah0000.png” holding in fact a JPEG file. Software creating the document didn’t bother to check or fix, viewing software may look either on file content (JPEG) and feed it into the JPEG decoder (OK), or it may look at the extension (PNG), and dumbly feed it into the PNG decoder, resulting in image missing, or even worse (hang, crash).

2.3  Rich text format - RTF

en.wikipedia.org/wiki/Rich_Text_Format

Created in 1987 in format version 1.0 for Word 3, updated all the time, latest version is 1.9.1 updated for Word 2007 / 12, but far less popular than Word DOC, now (since Word 2010) reportedly deprecated. There are some (limited) Win32 GUI viewers, but not usable in DOS. Specification is available, but reportedly delayed / incomplete.

2.4  Word DOC

en.wikipedia.org/wiki/DOC_(computing)

“ http://msdn.microsoft.com/en-us/library/cc313118.aspx ”

Used by MS Word from 1995 (??) until 2003 (??). Later versions still can read and write this format, but they prefer DOCX (see below), old DOC format doesn’t evolve anymore.

  • Created by MS Word / MS Office, payware, Windows only (+ partially MAC ??)
  • Previously DOC was a (different ??) closed document format for “Word Perfect” product
  • Viewable also by official “Word Viewer”, freeware, but closed source and Windows only (+ partially MAC ??)
  • Wordpad (??)
  • Open Office (non-DOS) supports it fully too (read and write), but doesn’t prefer it
  • Version differences (major change in Word 97 ??)
  • Privacy / metadata / deleted text
  • Has “security” features (see below)
  • Text stored mostly as plain (visible)
  • Images (file-in-file search)
  • ASCII/ANSI vs Unicode
  • Very popular
  • Publicly documented since 2008, previously only “on request”. Completeness ??

2.5  Powerpoint PPT, Excel XLS

(presentation, table)

2.6  Word DOCX - OOXML

en.wikipedia.org/wiki/Office_Open_XML

MS Word / MS Office 2007 brought a big change to the users: a switch from the “trusted” format “DOC” to the new “DOCX” AKA “Office Open XML” (note the word sorting: “Office” precedes “Open”) format. New versions still can read and write the old “DOC” format, but they prefer the new one, that older versions don’t understand (but there is a “free” addon). The new format “looks completely new”, and the “outer” file structure is very different from old “DOC”. It’s just a “standard ZIP” (same as ODT document, see below, and DOCX-unaware file format detectors will report it as ZIP file, see also Compress), holding several files inside several subdirectories. After unzipping, you can usually find the images included in the document in JPG or PNG format, or WMF/EMF (bad), but there is no file holding the “plain” or “almost plain” text. Because of the ZIP compression, DOCX documents are usually significantly smaller than same content in old DOC format (having no compression of text data or “overall” ), on the other side, this prevents separate solid or stronger compression of a set of (similar) documents. The format has been publicly documented since the beginning, and adopted as an ISO standard. There has been and still is much controversy about this format, the criticism includes following points:

  • Very complicated and messy
  • The format submitted to ISO differs from the one implemented in Word 2007
  • Recycles bad ideas from old DOC format
  • Documentation is incomplete despite 6′500 pages
  • Inventing / adopting this “new” format was not needed at all, since ODT format had been there before, and was working well

Use DOCX2HTM (see below).

2.7  Powerpoint PPTX, Excel XLSX

(presentation, table) upgraded with Office 2007 too (??)

2.8  Open Office - ODT

http://en.wikipedia.org/wiki/OpenDocument

… use ODT2HTM (see below) … also may contain WMF/EMF …

2.9  Portable document format - PDF

About

wikipedia.org/wiki/Portable_Document_Format

  • Created by Adobe in 1993
  • Crucial feature: Portable - looks everywhere same (as opposed to HTML “adjusting itself” to viewer’s capabilities or preferences)
  • Intended for printing, for “final” versions of documents, not easily editable (see below)
  • A PDF document consists of one or multiple pages, page size can be almost any, and is specified as physical size (unit is meter, not pixel), pages within a document have usually, but not necessarily, same sizes
  • Document content consists of elements (pictures or parts of them, tables, graphs, fragments of text (one char to one line at most??), …), they all have fixed positions within a page
  • Converting PDF into other formats (TXT, HTML, DOC, …) is possible, but difficult and results may be “messy” for some incoming PDF files
  • Open since beginning, but Adobe keeps the right to “maintain the standard”, and includes patented technologies (LZW84, JPEG2000, RC4, …)
  • Has “security” features (see below) since 1.3 (??)
  • Official Adobe “Acrobat” PDF creator, payware and expensive
  • Official reader freeware, but closed source, limited platform support, and bloated
  • Adobe Reader for DOS just 1 version 1.0 from 1993, supports just PDF 1.0 (now almost useless), download: acrodos1.zip (mostly historical interest, or for PDF’s from Antiword (see below))
  • Various alternative readers, open source or closed source freeware, but no “official” DOS support
  • Some plain text inside, but usually “formatting notes” only, not the content !!! Content almost always compressed using Deflate algorithm (or LZW84, especially in old documents), see Compress
  • Uncompress with PDFCLEAN and recompress with 7-ZIP to save space (achievable saving depends from content and varies much)
  • Very bad support in DOS before 2011-Feb

Format history

  • 1993 PDF 1.0 , Acrobat 1.0 (also DOS reader)
  • 1994 PDF 1.1 , Acrobat 2.0
  • 1996 PDF 1.2 , Acrobat 3.0
  • 1999 PDF 1.3 , Acrobat 4.0
  • 2001 PDF 1.4 , Acrobat 5.0
  • 2003 PDF 1.5 , Acrobat 6.0
  • 2005 PDF 1.6 , Acrobat 7.0
  • 2006 PDF 1.7 , Acrobat 8.0
  • 2008 PDF 1.7 Adobe Extension Level 3 , Acrobat 9.0
  • 2009 PDF 1.7 Adobe Extension Level 5 , Acrobat 9.1
  • 2011 PDF 1.7 Adobe Extension Level 8 , Acrobat X
  • 2011…2017 PDF 2.0 (there will be no PDF 1.8, instead PDF 2.0 specification will be released primarily by ISO, delayed ??)

Sources of PDF

A PDF document can be created by typing in text (+ creating tables, graphs, selecting fonts, …), usually through conversion from editable formats (TXT, DOC, ODT, …) rather than editing PDF “directly”, or by scanning existing documents from paper (many scanners do boast with “one click PDF export”). Latter process however will usually end up in a heavily “inferior” PDF document - the scanner outputs many pixels, rather than “objects” that are present in “original” PDF files. Such scanned PDF’s are bloated (10 or more times bigger than original) and have inferior quality, but they do vary in OCR efforts done after scanning, see below about OCR.

Searching and text extraction

Searching is similar to text extraction, as it needs the text to be “internally and temporarily” extracted. An attempt to save extracted file may however fail due to “security features” (see below). Extraction of PDF files is also used by Internet searching services like Google. Usually searching and text extraction do give sufficiently useful results, but there are some limitations (“not found” despite it’s in there !?!?) due to the PDF file format design. While it may be obvious that problems can arise from:

  • Text inside embedded screenshots
  • Text inside embedded photographic images
  • PDF files full of scanned raster images with no or little OCR done (see below)

further limitations can occur due to:

  • Excessive text fragmentation, instead of a full line in one block, the text is split arbitrarily many times, even in the middle of a word
  • Custom non-standard or messy encoding (other that ASCII or Unicode)
  • Letter decorations (“accent”, “umlaut”) provided not through “native” character codes, but through “combining accents” or even separate PDF objects
  • Extra spacing hints inside text
  • Text fragments “messed up” (left to right and top to bottom is the common but not the obligatory way to sort them inside the file)
  • There are images containing parts of content that is “supposed to be text” . This does occur not only in scanned PDF’s, but also in “native” PDF’s created from editable documents. Reasons for this are unknown, seems to be a sort of “bad optimization” or “laziness” during the creation process. Situation “image or just icon inside a table” may trigger it, but it is not that uncommon that in some PDF files all tables or even texts inside a colour frame (for highlighting) are stored as images.

Some of those “flaws” (text fragmentation) can be “repaired” during extraction but it is difficult and not guaranteed to succeed. Storing text in images is even worse, sure those parts will be missing in the extracted text, and also search feature, contained in most PDF viewers, will miserably fail to find anything inside those.

Embedded images in PDF

PDF documents can embed images in various formats, however PDF does not embed complete image files, instead it only reuses existing compression technologies (JPEG, PNG) and integrates them into the PDF format, so that “file-in-file” search will never find anything. There exist tools to extract images from PDF files, but results may range from “good” to “surprizing” or “messy”. While “native” PDF objects (text, tables, graphs, …) can be zoomed (almost) unlimitedly and losslessly, embedded raster images need to be resampled upon viewing, and upon creation one has to choose a size (in pixels) of the stored image, this is a dilemma as no size is actually “right”, and often PDF creation tools and users of them don’t bother at all to pick a “somewhat sane” size at least. This may result in poorly looking documents if the size is far too small, or bloated documents if it’s too big. The latter problem may become prominent after extraction if an icon “suddenly becomes” a large image. But there are further “absurd” problems that may show up when extracting images and text:

  • A seemingly “single image” extracts into several fragments, this means that it was actually stored that way, for obscure reasons. Even worse, the fragments may overlap or have mismatching sizes. To workaround this, one can create an image of a complete page at a high resolution, cut out the image, and zoom it down to preferred size. This will be a painful task for a PDF with many images.
  • There are duplicate or unused images.
  • The images are “too big” - they contain areas not visible in the PDF, cropped away upon displaying the page.
  • There are images containing parts of text (see above)

Editing PDF

PDF documents are difficult to edit. This is not a design goal, but a (potentially welcome) side effect of the design as “final document” ready for printing. While there are some programs for Windows (“Foxit PDF editor”) showing what can be done and how, and exposing the “internal secrets” of documents, there is no known solution for DOS. If you for example insert a word, you (or your PDF editor) will have to “manually” painfully move the remaining content (every single object) to make space for this word. Converting PDF into other format (HTML, TXT, …), editing it and creating PDF again is possible but may cause various difficulties and “loss”.

PDF to HTML conversion

It is usually possible to convert PDF into HTML format, but it’s rather difficult and results are usually suboptimal. This is due to “inherently incompatible” designs of those 2 document formats. Actually, it is much easier to convert (a static document) HTML into PDF than PDF into HTML. Several solutions do exist:

  • Google has had a built-in PDF to HTML converter for years, otherwise it could not search inside PDF files. For some found documents it offers a HTML version called “Cached”, for many other PDF documents not, nobody knows why.
  • PDFTOHTML by meshko is obsolete and buggy, better use PDFTOHTML from XPDF 3.04
  • XPDF (see below) 3.04 contains a PDFTOHTML tool which works with HX

A common problem of such converted HTML files is that they are full of CSS and absolute positioning (while “native” HTML files almost never use absolute positioning), and display badly in DOS browsers. All Google’s converter, PDFTOHTML by meshko and PDFTOHTML from XPDF expose this problem.

2.10  HTML and derivatives (MHT)

Arachne … images in separate files … messy CSS and Javascript … MHT used by Office and Opera (??)

2.11  Wiki

  • MediaWiki vs WakkaWiki
  • Also a document format - the “source” type
  • Local viewers ??

2.12  CHM

  • “Compiled HTML”, introduced in Windows 98
  • Set of HTML pages and additional data, compressed using LZX algorithm into 1 file, 7-ZIP can extract it, see Compress
  • In Windows viewed by the IE
  • Some viewers exist for Linux ??
  • http://www.nongnu.org/chmspec/latest/ Unofficial spec ??
  • FreePASCAL IDE can view it ??

2.13  TEX

en.wikipedia.org/wiki/TeX

A “source” type document format supporting text styles and mathematical formulas (strongly focusing on latter), can be found in some source code packages as the “source code” of the documentation, can be “compiled” into other formats like PDF, see also LaTeX .

2.14  FBHELP

Set of pages linked to each other, used for FreeBASIC only so far, DOS viewer exists, “compiled” from “WakkaWiki” format.

3.  Document security (DOC, DOCX, PDF & Co)

Some “advanced” formats (DOC, DOCX, PDF, PPT/XLS/PPTX/XLSX, ODT (??)) do provide “security” features. This includes actually 2 different things:

  • [1] Encryption of the document - a password is needed already to view it. This feature encrypts substantial parts of the document (text, images), and without the password, it is not decryptable and thus inaccessible. This is an alternative to encryption of the file using a separate program (ZIP, 7-ZIP). The benefit is “automatic” encryption and decryption just in the editor or viewer, and avoiding the need to store a decrypted file somewhere (risky), the flaw is, that various “metadata” (timestamps, author, size, amount of pages, …) may be not encrypted. This could be very secure if done properly (on developers side: good algorithm, correct implementation, sufficient key length), however many complaints do exist about the opposite (poor algorithm or implementation bugs (some Office versions only), crippled key length (40 bits, old versions of both office and PDF)). Such encrypted documents are for “internal use only”, you will hardly find any for download.
  • [2] Restriction management - document author can specify what users may do with the document and what not. The restrictions can affect printing, editing (complete document or even per paragraph), extraction of text, images or other information, converting into other formats or filling in form fields or adding notes. Viewing is still possible, and a password exists allowing the author to bypass, change or remove the restrictions. The viewer or editor looks up permission bits and allows the user to proceed with some action or not. This “technology” is technically absolutely dysfunctional and useless. If the file format is known and the document can be displayed, then it can be decoded and decrypted (or isn’t encrypted at all) for any other purpose too, without any “external” password. PDF file format has been always open, Word DOC was intended to be secret for years, but a popular publicly used format cannot be kept closed reliably and permanently. So anyone is free to develop an alternative viewer or editor that will simply ignore the restriction information and allow anything, or a tool that localizes and removes it from the file. To make this a bit harder, the file formats are very complex (“security by obscurity”) and official documentation usually carries a license prohibiting such usage. The PDF documentation does mention this problem very well, nevertheless, for both DOCX and PDF, this “feature” is still getting enhanced, and the companies continue advertising it and selling an illusion of security for real money to customers not reading technical documentation or thinking about doability. Most publicly available documents don’t have such restrictions, a small minority does, typically documents that the public can view and print, but not edit or extract or convert, or such that one can’t even print, somewhere in the document there probably will a note pressuring the user to buy a CD (holding same document but with printing “enabled”) or a printed book.

There exist tools (mostly Windows only, payware, and made in Russia) allowing to simply remove restrictions [2], or attempt to break the encryption [1] (good chance if password or implementation is poor), or even try to find the (useless) password set to “protect” the restrictions [2]. They do prominently advertise this effect (but license allows usage on author’s own documents only … ).

PDFCLEAN from MUPDF package does simply and silently remove possible restrictions from a PDF file it processes. This effect is not advertised anywhere. If the file is really encrypted [1], one can specify a password to decrypt it.

XPDF does respect possible restrictions, and allows to specify a password to bypass them. As documented on project page, author doesn’t appreciate mail subjecting to this issue and won’t help (“do-it-yourself-it’s-open-source”).

SumatraPDF (non-DOS viewer) respects restrictions too.

To “improve the security”, Adobe also introduced “LiveCycle PDF DRM technology” (unknown when exactly), using sort of “online validation” when displaying the document. There is no way to view the document, except using original Adobe reader on Windows, and having an open Internet connection. This one is harder to crack than the original permission bits, but there were reportedly successful attempts to do, still no ready-to-use solution. Fortunately, such documents are very rare, example of an “evil” document: HTTPS://aps.avowsystems.com/…SampleCertifiedPDF_RM.pdf .

4.  Unicode and UTF-8

4.1  About

  • “very large and complicated thing”
  • invented in the 80′ties as replacement of ASCII in order to support all languages of the world
  • originally designed as 16-bit encoding, this turned out to be insufficient, now Unicode defines abstract codes only (integer without size), and there are several encodings for Unicode, UTF-8 is the most popular one
  • compatible (referring to abstract codes, not necessarily with encoding) with ASCII for codes 32…127 and with ISO-8859–1 for codes 160…255
  • over 100′000 characters defined and many control codes (“RTL” - right to left text, wide range of spaces with various sizes and extra behaviours, …)
  • after a long and painful evolution process, major OS’es do support Unicode to a large degree (filenames, console, copy&paste, …)
  • missing characters are a problem inherent to the design - while good old ASCII fonts almost always contained all characters, this is not true for Unicode fonts … various fonts contain various subsets of the “complete” (actually still growing) amount
  • used in the Internet (UTF-8 HTML), mainstream editable document formats, PDF files, and also “plain” TXT files

Many recent applications (web browsers, document converters) allow to save “plain text”, but it’s Unicode by default or they even support Unicode (UTF-8) only. Most standard characters should be still readable by Unicode-unaware viewers, but there will be some “strange” characters too. Some documents even use “non-standard” flavors of common characters like dash or quotes for no good reason, such characters will be “broken” too.

4.2  Support and problems in DOS

  • no support in the DOS kernel
  • can’t be displayed to the console
  • can’t be used for filenames (DOSLFN allows to use some/many extra characters, but it relies on codepages rather than Unicode, FAT file system with LFN itself supports the original 16-bit Unicode design)
  • no support by most text viewers and editors
  • UTF-8 HTML supported by recent versions of Arachne (but only the most common characters)
  • Blocek text editor (see Editors) supports UTF-8 “plain” TXT
  • PDFTOTEXT supports UTF-8, but needs the “-enc UTF-8″ option

5.  OCR - Optical Character Recognition

OCR is a process opposite to printing - it converts an image (array of pixels) into (editable) text. OCR is much harder than printing, but doable. While poorly readable hand written text can be read by (some) humans only, sufficiently big and good quality printed text, or even well readable hand written text can be processed by OCR with sufficient success. This means that most of the text will be correct, but still will contain errors, because OCR programs usually can’t understand the “full context” of doubtful characters (consider letter “O” vs zero “0″, large amount of “letter decoration” existing in many languages (can occur even in English text in names of persons or locations), various text highlighting methods and habits (italic, bold, underline, …), additional letters and symbols (Cyrillic or Greek letters, math symbols, …), …) and other problems (text or image or table or just dirty paper ??), their “understanding” is limited to a list of known words (and derivative forms of them) at best. Optimal results can be achieved using interactive OCR only: program processes the text and asks the user about doubtful characters or other decisions, user sees the progressing result and may override any decision of the program, also the program may be able to “learn” to some degree.

Simple (relatively) OCR programs may output just plain text (ignoring styles and sizes, very limited possibilities to preserve the layout), more sophisticated programs may want to “reconstruct” the layout and text styles too, so they need an output format allowing to hold such information (RTF, DOC, DOCX, ODT, PDF, …).

Many programs prefer or support only PDF, because this format is easier to re-create than editable formats like TXT or DOC. The PDF format itself is very complicated, but due to its design (absolute positioning of objects within pages) it’s indeed “very suitable” as output format for OCR, it is not “the professional choice”, but rather “the cheap choice for half-baked-OCR” in this context. The created PDF may look sufficiently well, but attempts to edit it, extract text or convert into other formats will most likely produce rather unusable results. Existing PDF files (found on the Internet) and scanning software do vary in the OCR efforts done after scanning:

  • No OCR at all. PDF contains just images of pages, usually lossy (de-facto JPG), PDF document serves as a container only. There is no text inside the document (or just a few Byte’s of garbage(??)), the document is very bloated.
  • Partial OCR, automatic. The document does contain some text, but it’s full of errors, or incomplete. It looks OK when displayed (text fragments “merged” with images, or covered by them, text is “invisible” (??)), but still bloated, and text extraction produces an unusable mess, and image extraction may produce intact pages, but also broken/messy pages with some characters missing (!!).
  • If someone takes the pain of doing full OCR (interactive, manually fixing all errors, helping to recover the document structure, …), then the process will end up in a “proper” / “native” PDF document equivalent to the original.

6.  Programs

6.1  View

http://www3.sympatico.ca/mdrury/freeware.html

An universal (document) file viewer. Latest version is 17.2 from 2002–02–22 , by Malcolm Drury, formerly shareware, since 2004 freeware, but no source code.

A DOS based viewer for popular word processor and other file formats: Wordperfect, Word for Windows (all versions up to Word 2000 supported ??), DOS Word, Mac Word WWW HTML-formatted files, ASCII and ANSI, Ami Pro Wordstar, Windows Write / Wordpad, XYWRITE, Rich Text Format, Microsoft Publisher, and larisWorks. VIEW converts to and from ASCII / ANSI / UNIX, including save to UNIX format. VIEW has various text search and print functions. Text search parameters can be maintained across multiple files. Print and save to file functions include entire file, selected block, or just lines containing (or omitting) specified text.

Flaws:
  • Bad support for Word 2003 and later (program abandoned since 2002, in 2004 only subminor docs changes), and recent RTF flavors (try [CTL]-[F10])
  • [F3] and [F4] in directory view “broken”
  • View will be truncated for large documents
  • Text mode only, just one text size and style, no images
  • No support for PDF, DOCX, ODT

6.2  Antiword

Antiword is an open source converter for “old” Microsoft Word DOC documents by Adri van Os . It can convert the documents from Microsoft Word version 2 (release date ??), 6, 7, 97, 2000, 2002 and 2003 (some are missing ??) into plain text (even tries to reconstruct tables using the “wall” “|” character) or PDF (useless before 2011–02, now very useful, font styles and text formatting are partially preserved, some bugs left). Latest version is 0.37 from 2005-October-21. It is written in C and there exist, among support for many other platforms, 16-bit DOS binaries and 32-bit DGJPP binaries. Both are compiled from same source, but because of 16-bit limitations, the 16-bit version is inferior and can fail (refuse to work, pretend success but give partially broken result) on some bigger or more complex documents.

6.3  CATDOC & Co

http://www.wagner.pp.ru/~vitus/software/catdoc/

catdoc xls2csv catppt - converters for DOC, RTF, XLS and PPT - source code and 16-bit DOS binaries provided, abandoned in 2005 or 2006

catdoc is program which reads one or more Microsoft word files and outputs text, contained inside them to standard output. Therefore it does same work for .DOC files, as unix cat command for plain text ASCII files.

It is now accompanied by xls2csv - program which converts Excel spreadsheet into comma-separated value file, and catppt - utility to extract textual information from Powerpoint files.

6.4  Martha

RTF <-> HTML converters

6.5  DOCX and ODT converters by Ron

Ron ’s experimental converters (slow, using 7-ZIP):

http://www.ausreg.com/files/docx2htm.zip http://www.ausreg.com/files/docx2htm.txt http://www.ausreg.com/files/odt2htm/

6.6  XPDF

A project including a PDF viewer/reader (Linux only, not usable in DOS), as well as commandline tools for PDF info, conversion and extraction (portable).

http://foolabs.com/xpdf

ftp://ftp.foolabs.com/pub/xpdf (only FTP, and only latest version, author doesn’t provide any older versions for download)

ibiblio.org/…/xpdf mirror of DGJPP binaries of 3.03 and 3.02pl5 3.02pl1, no 3.04

en.wikipedia.org/wiki/Xpdf not much info

See far below for BTTR forum threads.

By Derek Noonburg AKA Glyph & Cog, LLC , open source, dual licensing (GPL 2 or 3 or commercial), written in C++ . Apparently an original work developed by the author from scratch using the PDF specification.

XPDF version history (see also foolabs.com/xpdf/CHANGES) :

  • 0.2 1995-Dec-12
    • earliest public release
  • 3.02 2007-Feb-27
    • had subminor versions “pl1″ to “pl5″
    • last version with DGJPP binaries (excluding PDFTOPPM) released by author
    • added mostly support for PDF 1.6 and PDF 1.7
  • 3.03 2011-Aug-15
    • dropped DGJPP support and completely removed DOS from project page (but there are DGJPP binaries (excluding PDFTOPPM) compiled by someone else - Rugxulo)
    • added Win64 binaries (besides Win32 ones)
    • added PDFDETACH tool
    • implemented 256-bit AES decryption
    • console Win32 binaries compiled with MSVC8/2005 and don’t depend from any MSVC*.DLL files, they do run with HX and on P1 processor (they do run on ME/98 too)
    • PDFINFO and PDFTOTEXT work well
    • PDFTOPPM does “run” with HX but outputs garbage (see below)
  • 3.04 2014-May-28
    • “new” text extractor, extra tuning options, removed the useless “simple HTML” feature
    • added PDFTOHTML tool
    • added PDFTOPNG tool
    • console Win32 binaries do run with HX and on P1 processor (they do NOT run at all on ME/98)
    • “new” PDFTOPNG tool just duplicates the previous “PDFTOPPM” and suffers from the very same problem (system fonts), it’s useless with HX
    • PDFINFO and PDFTOTEXT work well
    • PDFTOPPM and PDFTOPNG do “run” with HX but output garbage (see below)

Included tools:

  • PDFDETACH can extract subdocuments from a PDF file, OK with HX.
  • PDFTOPPM and PDFTOPNG (create raster images from PDF pages) have no DGJPP binaries and have never had any, and the Win32 binaries are not usable with HX (use MUPDF viewer or PDFDRAW instead).
  • PDFINFO and PDFTOTEXT are OK in DOS (HX or DGJPP port). Both do support the “-enc UTF-8″ option, default is “Latin1″ (AKA “ISO-8859–1″, “iso-ir-100″, “csISOLatin1″, “l1″, “IBM819″, “CP819″), characters not convertible into output format are getting silently skipped (bad). PDFTOTEXT can output plain text (default) or (only 3.03, rather useless) simple HTML.
  • PDFTOHTML tool added in version 3.04 and works with HX despite complaints, but result does not look good in any of DOS browsers, so it’s not that useful at the end (see above “PDF to HTML conversion”)
  • PDFIMAGES is OK with HX, but not very verbose and output may be “not as expected” (see above “Embedded images in PDF” why, problem is inherent to PDF document format and PDF creation tools)

Document restrictions (see above) are respected.

6.7  Ghostscript

http://www.ghostscript.com/Ghostscript.html

http://en.wikipedia.org/wiki/Ghostscript

http://www.glennmcc.org/apm/gs705dos.zip Ghostscript DGJPP binary (obsolete)

6.8  MUPDF

MUPDF is a derivative work of Ghostscript for viewing and processing PDF files, dual licensing: GPL 3 or commercial license.

Project page: mupdf.com

Download archive: http://mupdf.com/downloads/archive

See far below for BTTR forum threads.

MUPDF version history (see also mupdf.com/news) :

  • 0.6 2010-Jun-02
    • ??
  • 0.7 2010-Aug-24
    • has DOS ports by RayeR and Georg
    • slightly inferior image rendering quality compared to 0.8
    • several “crashing and exiting” bugs
    • some memory leaks, needs more memory than it would be necessary and than later versions do, while the problem has moderate impact on most PDF’s, on some (few) documents the memory wasting is rather extreme, and the DOS port may crash while Win32 binary doesn’t (does not immediately)
  • 0.8.165 2011-Apr-29
    • slightly better image quality (some fonts and raster images)
    • fixed those “crashing and exiting” bugs, program tries to “do the best” even on broken PDF files, for example outputs a page with one image missing rather than an empty page, or even crash
    • mostly fixed the memory leaks
  • 0.9 2011-Sep-05
    • Win32 PDFDRAW binary works on DOS with HX
  • 1.0 2012-Apr-24
    • ??
  • 1.9 2016-Apr-18 and 1.9a 2016-Apr-26
    • latest, more bloat, untested

In 2011–02 Rayer ported MUPDF 0.7 (latest version of that time) to DOS , providing a DGJPP binary of a true reader with GUI output (VGA or VESA), the earliest one after the desperately outdated official Acro 1.0 . Later Georg compiled another (probably a bit better) binary.

There exist so far 2 DOS viewer binaries, both based on MUPDF 0.7 :

  • RayeR ’s one compiled 2011-Feb using raw VESA
  • Georg ’s one compiled 2012-May using the Nano-X AKA NXlib GUI library nanox…mupdf.zip

Georg’s binary is only marginally bigger, the size of both is cca 3.5 MiO UPX’ed and 7.5 MiO plain, this is good compared to other recent PDF viewers, note that MUPDF does NOT depend of “system fonts” unlike for example XPDF or SumatraPDF.

Observations common to RayeR’s and Georg’s binaries:

  • Crashes or aborts on some pages, program reliably crashes on a particular page of a particular document, this simply means that the affected page can not be viewed at all with this version, not even when directly requested from commandline. The “risk” of a crash depends from document content and tool used to create it, it can be pretty painful. Also the official MUPDF 0.7 Win32 binaries have this problem, it’s fixed in 0.8.165. MUPDF 0.9 does not suffer from this bug, so preferably use PDFDRAW from MUPDF 0.9 and a picture viewer to access those pages.
  • Crashes (rarely) on some documents or pages of those, while official MUPDF 0.7 Win32 binary does not, this seems to be nevertheless a problem of the 0.7 codebase rather than those DOS ports (see above “MUPDF version history”)
  • No search (“/” command in Win32 versions, missing in RayeR’s binary and desperately broken in Georg’s one, use PDFTOTEXT from XPDF)
  • Loading of a file may take very long time with no progress indicator, this is not a problem at all for 99% of sane PDF files on the internet, for maybe 1% of files (rather big ones) it is, as loading takes far much longer time than one might guess from file size or amount of pages, later switching between pages and “goto” (Georg’s binary) are usually not slowed down
  • Some few pages of a document may take much longer time to display than the average of other pages, program seems to hang while it actually doesn’t
  • Slightly inferior picture quality compared to MUPDF 0.9
  • Page number can be specified in commandline MUPDFD BLAH.PDF 100
  • Use arrow keys to scroll within a page and [PGUP] and [PGDN] keys to switch between pages, [+] and [-] to zoom, [F1] minimal help, no mouse control
  • Zoom range is restricted to 54…300 dpi (some other viewers have “no limit” but crash if you try)

Observations specific to RayeR’s binary:

  • RGB-vs-BGR-BUG (not remarkable if document contains just black text)
  • No horizontal scroll (this can be pretty painful, [ARROW-LEFT] and [ARROW-RIGHT] keys duplicate [PGUP] and [PGDN])
  • No “search”
  • No “goto” (use commandline)

Observations specific to Georg’s binary:

  • No docs included and binary does not print any version or port info (RayeR’s one does)
  • Environmental variable like “SET NANOSCR=1024 768 565″ can be set to tune the screen resolution but program runs even if it is not
  • There is a mouse arrow but program can’t be controlled by mouse, also mouse arrow does not react while program is loading a “tough” document or processing a “tough” page (be patient!)
  • Can scroll the page horizontally and vertically (use arrow keys)
  • Page can be scrolled “too far away” on right and bottom (not a big problem)
  • Program leaves behind broken screen if it fails to display given file for whatever reason (file not found, file is not PDF, file is corrupt, file has “security features”), or after some of the page-specific crashes (see above)
  • If file contains pages with different sizes (rare), then the viewer can leave behind garbage on the screen after page switch (not malicious, and scrolling or zooming will clean it out)
  • “search” desperately broken
  • “goto” works (type decimal number (not visible anywhere) and then “g”)

From version 0.9, all commandline tools do work in DOS, but nobody got the GUI viewer working. Versions 1.2 and 1.3 are out, the commandline tools got merged into one, minimal tests in DOS done (seems to work), version 1.9 is out (untested).

  • MUPDF.EXE - viewer with a minimal GUI
  • PDFDRAW.EXE - convert PDF into pixel images of full pages (OK with HX) (pdfdraww.zip) Can also extract text but buggy (version 0.9, memory leak ??), so better don’t drop PDFTOTEXT (from XPDF package, see above)
Syntax: DPMILD32 PDFDRAWW -o xx%03d.png xx.pdf 1–20
Pitfalls:
  • “.png” is case sensitive, so “.PNG” won’t work
  • page numbers used by the program usually don’t match numbers included on the pages

Further interesting options:

  • “-r” resolution / size
  • “-m” progress indicator
  • PDFCLEAN.EXE - rewrite PDF, remove restrictions, fix broken PDF (??), optimization features (buggy in 0.9), decompress

6.9  APM’s for PDF (hacky and slow, obsolete)

http://www.glennmcc.org/apm/ Many APM’s (some do address PDF)

http://www.glennmcc.org/apm/ara-pdf.apm

http://www.glennmcc.org/apm/gs705dos.zip Ghostscript DGJPP binary

http://www.glennmcc.org/apm/pdfhtm.apm

6.10  OCR programs

OCR (see above) in DOS is a potential problem, because:

  • OCR programs are frequently bundled with scanners and are “part of the scanner driver”, making them unusable in DOS
  • For good results interactive OCR is needed, and interactivity needs a GUI

Nevertheless there are at least 2 somewhat working and useful OCR programs for DOS (well, ported to DOS). They are both commandline, non-interactive (there are tuning options), not very fast and they output plain text (TXT).

7.  Online converters (any usable among them ??)

  • Free without registration, free with registration, or paid
  • Obviously they need HTML upload (nope with Arachne and DOSLynx)
  • Many need also Javascript or even Flash
  • Can convert DOC or DOCX or PDF into TXT, DOC, DOCX, HTML, Wiki or remove “security” restrictions (see above)

8.  External resources

BTTR forum threads:

  • id=9473 “PDF readers for DOS” 2011-Feb (very long, XPDF and MUPDF)
  • id=9558 “PDF DOC DOCX ODT RTFM - document formats please help to comp” 2011-Feb (about documents)
  • id=11498 “XPDF 3.03 (DJGPP)” 2012-May (PDF, XPDF 3.03 DGJPP port, MUPDF Georg ’s port, crashes)

9.  See also

Edit - History - Print - Recent Changes - Search
Page last modified on August 03, 2016, at 07:22 AM