Text Filter Programs
The word filter, as I use it here, is a program which translates a file in one format to a file in different format. Usually both of these files are in text format.
One filter I wrote is called mll It does something I've never seen another program for (though it is fairly simple to write a short script to do it). The program measures the length (in characters) of all the lines in a file and reports the longest line and various other pieces of information. I use it because when I write programs I want them to be readable so I like to keep the lines shorter than the width of a page.
Another example is ctab which, like a thousand other programs, converts tab characters to an equivalent number of spaces. I wrote it because I couldn't find any conversion programs that did all I wanted to do, like
- Stopping conversion at the first non-whitespace character on a line.
- Converting tabs to spaces except inside quoted strings.
- Converting from spaces to tabs.
(Unlike conversion the other way, this process is not deterministic; there are many ways it can be done. This is described in the program's help text and options are provided to give the user some control over how the conversion is done).
I developed the filter framework this way:
Several years ago I realized I needed to do a lot of conversions of text files from one form to another. These were tasks like changing all the <TAB> characters in a file to an visually equivalent number of spaces or finding the length of the longest line in a file so I could be sure that lines would not be longer than the screen width or the width of the paper on which they were printed.
There were already programs (such as expand() on *nix) to do some of these things, but I wanted these filters to run on operating systems other than *nix. In the case of expand I also wanted a text filter that was a little more versatile.
As I wrote more of these I noticed I was that the processing of all the filters followed a similar model:
- parse the arguments on the command line
- evaluate the option arguments
- loop through the remaining arguments
- treat the remaining arguments as text files
- loop through the text files
- open each file
- possibly open an output file
- process the file in a filter-specific way
- close the file
Except for actually processing the input, all the filters were doing the same thing.
I decided to design and abstract the common actions into support routines so that for each filter I would only have to write a routine that actually processed the file.
I ended up with a model where the designer would only have to write two routines
- the main routine
- a routine to actually process the file
The two routines have these responsibilities:
The main routine does not have much to do. It just receives the command line arguments (argc and argv) from the user's command line and passes them directly onto another routine called, oddly enough,
The main routine actually passes three arguments to filter(). The first two are the original argc and argv and the third is a pointer to a user-written routine that processes an individual file.
The filter() routine arranges things so that the main routine does not have to open or close the file or do any other file system operations other than reading and writing.
The user routine just needs to process the file, usually line by line. In most filters it writes a file that is a transform of the input. In a few filters it does something different, such as determining the longest line in a file
The user routine does not have to open or close any files. To make things easier the filter() routine opens each file specified on the command line and assigns it to the standard input (stdin). This way the user routine does not have to worry about file pointers or other file specifics. Each time the user routine is called it reads from stdin and writes to stdout. Stdin has already been redirected to reference the input file specified on the command line.