Virtual OS/2 International Consumer Education
VOICE Home Page: http://www.os2voice.org
September 2004

Newsletter Index
< Previous Page | Next Page >
Feature Index

editor@os2voice.org


(PMM)Bogofilter - Customizable SPAM Filtering

By Walter Metcalf © September 2004

What is (PMM)BogoFilter?

BogoFilter is a spam-filtering system based on statistical analysis. A method of using statistics (probability) in processing word lists to determine the relative probability of the occurrence of words was first developed in a paper by eighteenth century Thomas Bayes, an English Presbyterian minister and Mathematician. He developed a theorem by which the probability with which a certain word would occur in a set of numbers or letters could be determined. To calculate this distribution, one first needed observations based on a subset of the larger set. (This translates into the so-called training all Bayesian filtering systems require.) For the complete theorem and a discussion thereof, see http://en.wikipedia.org/wiki/Bayes'_theorem#Statement_of_Bayes.27_theorem. Please note that I am not a student of statistics, so that the above attempt at paraphrasing Bayes' very complex theorem may not be completely accurate!

The development of modern anti-spam tools based on Bayes' theorem is largely based on the works of Paul Graham. Graham has written several papers and given conferences on the subject of spam. If you are at all interested in the methodology of spam and spam-killing, you should read his articles A Plan for Spam and Better Bayesian Filtering. These articles are not hard to read, and I found them fascinating.

Eric S. Raymond, co-founder and President of the Open Source Initiative, wrote a Unix program called BogoFilter based on Paul Graham's work, and contributed it to SourceForge.net. (The complete project URL is bogofilter.sourceforge.net.)

Fortunately for us, BogoFilter was ported to OS/2 by Yuri Dario, and then Doug Bissett developed a procedure which automatically installs BogoFilter into, and customizes it for, PMMail/2. The result, called PMMBogoFilter, is a sophisticated Bayesian anti-spam tool that runs automatically as you use PMMail/2 to read your email.

How is BogoFilter Different?

There are two general types of anti-spam filtering systems: Bayesian and rule-based. PMMBogoFilter is an example of a Bayesian system, and JunkSpy is an example of a rule-based system.

PMMBogoFilter differs from rule-based systems in many ways:

  1. PMMBogoFilter is free.

  2. With PMMBogoFilter (as with all Bayesian systems) every word in the message is used to help determine if the message is SPAM or not. Rule-Based systems, on the other hand, limit their analysis of the message to looking for the key words or phrases specified in their database.

  3. PMMBogoFilter uses statistics to determine what messages are SPAM. In a nutshell, this is done by keeping two lists:¹ one of words usually seen in SPAM, and one of words usually seen in HAM (i.e. messages that are not SPAM). As email messages are received, each word is compared with both the SPAM wordlist and the HAM wordlist. BogoFilter uses relative counts for each word to set a switch (called BOGOSITY) and adjust a counter (called SPAMICITY). When the message has been completely processed, the BOGOSITY and SPAMICITY values are used to determine if the message is SPAM or not.

  4. PMMBogoFilter, as with other Bayesian filters, requires training.

    1. I'll elaborate on this a little later. For now, it is sufficient to state that after you install PMMBogoFilter, you must train it as to what you consider good mail (HAM) and what you consider SPAM.

    2. This has two very important corollaries:

      1. To train PMMBogoFilter, you need to run a utility over large samples of email, and then run another utility periodically once it is installed. Thanks to PMMail's advanced features, and to Doug Bisset's clever design of the installation program, most of the periodic maintenance can be automated using freeware programs like Cron or commercial programs like Relish from Sundial Systems.

        • Even when the maintenance is automated, some routine manual maintenance and monitoring is still required to get the best results. Therefore, PMMBogoFilter may not be the best choice for neophytes or people who only receive a small amount of email and very little SPAM.

      2. PMMBogoFilter can be customized so that its database is closely matched to the kind of email you receive, as well as your needs and tastes.

        • By way of contrast, rule-based systems are usually sold by software houses that provide (sometimes for a fee) regular updates of the database. (This is necessary because rule-based e-mail filters have no system of built-in learning.) The great weakness of this approach is the database and any updates must apply equally well to all customers of the company manufacturing the product.² Consequently, the database and its updates are highly unlikely to perfectly suit any single customer, such as you.

        • PMMBogoFilter, on the other hand, has a sophisticated built-in learning or self-training procedure. Together with the periodic training discussed above PMMBogoFilter can produce more accurate results than most rule-based systems by itself, without having to depend on generic updates from an outside source.

Installation of PMMBogoFilter

The ultimate source of installation instructions are found in the main documentation document PMMBogoFilter.html, which accompanies the program. (This document is also available on the Internet at http://www3.telus.net/public/bissett1/PMMBogoFilter.html.) The installation procedure automatically places a clickable icon for this document on your Desktop. Therefore, I will only provide a survey of the most important points of the installation to illustrate the setup. If you decide to install PMMBogoFilter, you must read PMMBogoFilter.html in its entirety at least once before trying to use the program.

  1. Requirements

    1. OS/2 or eCS.

    2. Sufficient hardware to run your version of OS/2 or eCS.

    3. A connection to the Internet, and a POP (i.e. not web-based) email account.

    4. Classic Rexx installed.

    5. PMMail for OS/2 v1.96a or v2.x must be installed.

    6. WarpIN, v1.0.1 or higher must be installed. This can be obtained from the XWorkplace.org web site at http://xworkplace.org/proj_warpin_download.html.

    7. The zipped archive of one of the specified versions (currently 0.15.3.2 or 0.92.4) of BogoFilter must be located in the same directory as the PMMBogoFilter file you plan to install.

    8. To use the latest version of BogoFilter (0.92.4), you must have a 32-bit version of TCP/IP installed.³ Yuri Dario is working to lift this restriction.

  2. Procedure

    1. If everything is set up as described above, then you can start the installation process by double-clicking on the WarpIN archive of the PMMBogoFilter you wish to install.

    2. If PMMail was correctly installed by its installer, then everything should go as planned, in which case PMMBogoFilter will be installed both on your hard drive and into PMMail correctly. However, there are several situations that will cause the install to skip a step. This is usually done to avoid any chance of data corruption. However, this means you must check that everything is in place. Here is a summary of the changes to PMMail so that you can easily check them.

    3. Three types of objects are installed by the installer. They are:

      1. The program and data files in the PMMail directory and on the Desktop. There is usually no need to look at these.

      2. The entry in the REXX tab of the Settings | Rexx page of the account notebook.

      3. A set of four filters in the Settings | Filters page of the account notebook.

        Note that these items are installed once for every existing account (unless the installer encounters anomalies in your setup, as noted above).

    4. The following tables are reproduced for your convenience (with permission) from the documentation file at http://www3.telus.net/public/bissett1/PMMBogoFilter.html. They contain the information that should be in each PMMail account notebook.

      1. This REXX entry causes PMMBogoFilter to be invoked whenever mail is read.

        • If you want to see PMM_BOGOFILTER as it is running, check Execute Script in Foreground. However, due to a quirk in PMMail, this doesn't always work.

      2. This filter checks if BogoFilter THINKS the e-mail is SPAM

        Description:
        Bogo SPAM
        Type/Complex:
        Complex
        tests:
        ((header="spamicity=1")
        |(header="spamicity=0.9")
        |(header="spamicity=0.8")
        )
        Type:
        Incoming and Manual
        Actions:
        Delete Message Local copy
      3. This filter checks if BogoFilter THINKS the e-mail is SPAM

        Description:
        Bogo Filter
        Type/Complex:
        Simple
        Search:
        <Header>
        For:
        X-Bogosity: Yes
        No Connective
        Type:
        Incoming and Manual
        Actions:
        Move message to the BOGO SPAM folder
        (or any other folder, as you wish, which must exist, before you can select it. You must create the Bogo Spam folder, in each account, using PMMail, if the installer did not do it.)
      4. This filter trains BogoFilter that a message is SPAM

        Description:
        Train as SPAM
        Type/Complex:
        Complex
        In text box:
        header.subject="a"|!(header.subject="a")
        Type:
        Manual
        Actions:
        User hook (background)
        (or, foreground to use DEBUG=1)
        In Command box:
        D:\bsw-inc\user_tools\pmm_train_spam.cmd
        (change to match your installation)

        Optionally, you can add a second (or more) action(s) to this filter (This is now the default):

        Delete Message
        Local copy (puts it in the TRASH folder)
      5. This filter trains BogoFilter that a message is NOT SPAM

        Description:
        Train as NOT SPAM
        Type/Complex:
        Complex
        In text box:
        header.subject="a"|!(header.subject="a")
        Type:
        Manual
        Actions:
        User hook (background)
        In Command box:
        D:\bsw-inc\user_tools\pmm_train_no_spam.cmd
        (change to match your installation)

        Optionally, you can add a second (or more) action(s) to this filter:

        Move Message
        Inbox (or, anywhere else that you like)

        See the PMMBogoFilter.html document for details on how to fix any of the above statements that are missing or correct.

Use of PMMBogoFilter

If you have checked over the installation, and made any adjustments necessary, PMMBogoFilter will run automatically, and should require only minimal attention.

You should go now to the PMMail main folder, and highlight one or more messages in an account for which you have installed PMMBogoFilter. Right-click on the highlighted message(s), and note that there is now a new entry on the pop-up menu: Apply Manual Filters. Left-click on the arrow, and you will see a group of selectable items. See below:



These new options provide the user with the ability to manually run the filters described above. The most common use of these options, however, will be to train PMMBogoFilter.

Depending on the exact settings of your filters, PMMBogoFilter will deposit messages in the Bogo Spam folder (usually created automatically by the installer), in the Trash, or in the Inbox. You can also have known SPAM deleted immediately, but this is not recommended.

Before letting PMMBogoFilter run automatically on your system, you need to train it on both SPAM and HAM (good email) you already have on your hard drive. The more email you use to train PMMBogoFilter the better (within reason), but BogoFilter documentation states that 500 good and 500 bad messages are required for proper functioning. Select a group of messages you want classified as bad, and select the Train as SPAM option above. Then do the same with a group of messages you want classified as good, and select Train as NOT SPAM.

You should check the Bogo Spam and Trash folders periodically for messages that PMMBogoFilter did not process correctly or according to your satisfaction. If you find valid messages in either the Bogo Spam or Trash folders, then you should highlight the message(s), and select the Train as NOT Spam option as shown above. If you find Spam messages in the Inbox or other folders, you should highlight them, and select the Train as Spam option. Doing this regularly will keep PMBogoFilter running accurately and efficiently.

One last item you need to take care of -- and I found this particularly important -- is the word list (or database). Left to run automatically without attention, PMMBogoFilter will cause the word list to become indefinitely large. This will have the effect of slowing down email processing, and can even lead to errors. Large word lists are mainly caused by redundant information. Doug Bisset has provided a utility to alleviate this problem. This utility is called pmm_bogo_maint.cmd. To keep the word list in optimal condition, this utility needs to be run on a regular basis. The frequency will depend to some extent on your own situation, but I have found it needs to run daily on my system. (You should try running it monthly or weekly and monitor both the size of the wordlist and the degradation of the email processing.) The best way to ensure the utility is run regularly is to use a scheduling program such as cron or Relish. Cron is a freeware program that can be downloaded from Hobbes, and Relish is a sophisticated time management program available from Sundial Systems.

The good news is that after training PMMBogoFilter properly, setting the maintenance of the word list to the optimal frequency, and giving PMMBogoFilter time to get used to my email stream, I am now getting far better SPAM-filtering results than I did on my previous commercial rule-based anti-spam tool. My PMMBogoFilter setup also runs with fewer problems than my previous tool did.


¹In the later versions of PMMBogoFilter both word lists are kept in one file.

²Some rule-based anti-email systems include tools whereby individual users can make limited customizations of the database.

³OS/2 Warp and OS/2 Warp 4 were both shipped with 16-bit versions of TCP/IP (v4.0 or lower). If you have one of these OSes, and need to use BogoFilter v0.92.4, you must purchase the 32-bit version of TCP/IP (v4.1 or higher) from IBM. Visit Alex Taylor's web site at http://eddie.cis.uoguelph.ca/~alex/os2/fixpaks/netfixes.html for more detailed information.

If you have eComStation, you probably already have a 32-bit version. If you're not sure, then type SYSLEVEL at an OS/2 command line. If TCP/IP 4.1 is not installed, then download and install the latest convenience pack from http://www.ecomstation.com.

References:

Thomas Bayes: http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Bayes.html
Bayesian theorem: http://en.wikipedia.org/wiki/Bayes'_theorem#Statement_of_Bayes.27_theorem
Paul Graham: http://www.paulgraham.com/paulgraham/antispam.html
"A Plan for Spam": http://www.paulgraham.com/spam.html
"Better Bayesian Filtering": http://www.paulgraham.com/better.html
Eric S. Raymond: http://www.catb.org/~esr/
Sourceforge: http://sourceforge.net/
Bogofilter Project at Sourceforge: http://bogofilter.sourceforge.net
JunkSpy: http://www.junkspy.com
Cron: http://hobbes.nmsu.edu/pub/os2/util/schedule/cron.zip
Sundial Systems: http://www.sundialsystems.com
PMMBogofilter documentation: http://www3.telus.net/public/bissett1/PMMBogoFilter.html
XWorkplace.org: http://xworkplace.org
WarpIN download: http://xworkplace.org/proj_warpin_download.html
Networking fixpaks: http://eddie.cis.uoguelph.ca/~alex/os2/fixpaks/netfixes.html
eComStation: http://www.ecomstation.com


Feature Index
editor@os2voice.org
< Previous Page | Newsletter Index | Next Page >
VOICE Home Page: http://www.os2voice.org