Multiplying Goalposts
Professional, Technology Comments (11)
Some years ago I stopped following the USENET group news.admin.net-abuse.email. However, like many people who do what I do, it is somewhat necessary to follow for work-related reasons. So, I’m reading it again.
There is an interesting discussion going on now that is rather telling for people in the industry. It centers around what words mean. This isn’t a case of “I say ‘tomato’ and you say ‘tahmato’”, but rather that some people are calling apples oranges and others are calling apples cherries.
The root cause of this is a big fight currently going on between Al Iverson and Matthew Sullivan. Al is compiling stats of DNSBL accuracy rates. One of the groups that didn’t turn out as well as they thought they should have was Matthew’s SORBS list.
Let’s start with the definition of “spam.” NANAE is currently sporting at least 4 different definitions of spam.
We start, of course, with Al’s definition of spam. Al seems to define spam as “mail that comes to my spamtrap addresses which was unasked for”. We can generalize this to “unsolicited bulk email”. Under this definition, mail sent to the spamtrap address is, by definition, unsolicited and bulk (since the address is not in use for 1-to-1 email) and is counted as spam with some minimal processing to remove backscatter.
Matthew Sullivan, on the other hand, defines spam as “anything that is considered spam under the [Australian] Spam Act 2003 is spam” (given that Matthew is not an attorney, barrister, or solicitor, we should hasten to add “in his opinion”). We can generalize this to “mail which violates some legal standard”. This can quickly become sticky as a blocking mechanism open for use across countries. What violates the legal standard of the Spam Act of 2003 does not necessarily also violate the legal standard of the CAN-SPAM Act here in the United States.
Then we have another definition of spam stated by Laurence F. Sheldon, Jr., as all mail which comes from “a source most blacklist users would identify as spam-source under the ‘Boulder Pledge’ or a similar notion.” We’ll call that the Justice Potter Stewart “I know it when I see it” standard. Please note that Mr. Sheldon also states elsewhere that spam is “unsolicited bulk email” so we need to combine these two into “the group knows it when the group sees it”. This has the benefit of possibly allowing for some form of “bulk” to be recognized, but has the drawback of one person within the group making a claim which is then recognized as authoritative by the entire group even with no one else’s concurring experience. And then there is the problem of defining who is in “the group”.
Finally, we see this (inverse) definition of spam given by Chris Lewis in response to this quote: “It’s mail that he signed up for, making it solicited and thereby not spam.” Chris says: “Which is not representative of the email that users want.” So, we can call this definition of spam “mail my users (as a group) do not want.” For the end user, this transforms into “mail that I do not want”.
Now, is one of these definitions any better than the others? That’s not really for me to say, although I cut my teeth with “unsolicited, bulk email”. But, as a professional in the space, it’s important that we take time to understand what people are talking about when they use certain terms. When someone clicks a “This is spam” button, are they saying that this is “unsolicited, bulk email” or are they saying that this is “mail which I do not want (anymore)”? For me, this is a critical consideration. If it means “unsolicited, bulk email” then I have a client with a real problem on their hands. If it means “mail which I do not want” then all that’s called for is unsubscribing the user.
This fundamental problem of using different dictionaries means that we will never find a solution to “the spam problem” as long as we can’t decide on what “spam” really is. It’s a problem that my friend Laura once called “moving goalposts”. It’s an apt description, but I think that we may need to change that to “multiplying goalposts.” I’ve pointed out above that in a single thread in a single newsgroup we can identify at least four different, contemporary definitions for the same word. The goalposts haven’t really moved at all. It’s just that there is now a new set out there in addition to the old ones.
We can also look at the definition of “false positive” in the same thread. A false positive is a medical term generally defined as “A result that is erroneously positive when a situation is normal.”
There are two definitions for false positive given in this post: “That which is listed, but doesn’t meet the list’s criteria” and “mail that I wanted which got blocked”. Al Iverson’s DNSBL stats site defines “false positive” for his usage as “false positive would be something I likely did sign up for and then forgot about”. Huey Callison, on the other hand, gives “a nonspam mail blocked by a spam filter” as the definition.
Again, we have four contemporaneous definitions within the same thread. They’re all in use. So, what, exactly, is a “false positive.” It’s going to depend on who you are talking to and perhaps the context of the discussion.
Ultimately, if the “spam problem” is going to get fixed, we’re going to have decide on a single set of definitions (goalposts) for whatever it is that we are talking about.
MickC @ November 8, 2007



Many ISPs have accepted that “mail our users don’t want” correlates very strongly to spam, but I agree that using that as a definition fails in both directions – as one large ISP staffer has said, “Regardless of whether your mail is confirmed opt-in or not, if my users have indicated that they don’t want it, I am disinclined to argue with them”.
The redefinintion of ‘false positive’ as anything other than “nonspam mail blocked as if it were spam” is what I think bothers me the most out of any of this little bit of linguistic imprecision, because effectively no one is using SORBS for the purpose of determining whether or not it’s functioning according to its published listing criteria – they’re using it to block spam. Whether it is run dishonestly, or incompetently, or out of spite or malice – all of those things are completely immaterial. What matters is “Does it block spam, and not block not-spam?”
I certainly don’t deny them the use of whatever metric they choose. And I am also inclined to agree with not arguing about the level of opt-in used to send email to people who express some preference about not wanting it anymore.
What bothers me is calling your (generic) redefinition “spam”. Our problem is that we now have competing definitions for the same term, and people are using (all of) them them carelessly.
That’s also what bothers me about redefining “false positive.” People have a certain expectation about using a DNSBL (high “true positive”/low “false positive”). But, unless they’re happy with Matthew’s definition of “false positive” (which I take to be “is listed but does not meet listing criteria”), that may (and probably will) not meet that expectation.
To some degree we have to live with this. But, in the long run, unless we can agree on definitions, we can’t really do anything more than a rearguard action. We won’t make any progress, and will continue to slowly lose ground.
What bothers me is calling your (generic) redefinition “spam”.
Er, no. I think the only sensible definition of spam is email that is both unsolicited and bulk, although I understand why large ISPs have chosen to focus their attention on “mail our users don’t want” – because that’s the only obvious useful feedback they have.
Short of adding another step to confirmed opt-in, involving cryptographically secure tokens and a trusted third party [1] there’s no way to determine affirmative permission after the fact, only the lack of permission. And even then, only in obvious cases, i.e: mailer-daemon obviously did not solicit your newsletter. So, while you can detect ‘bulk’, you can’t really detect ‘unsolicited’ most of the time, so you go with the next best thing: email that a significant number of users have clicked “this is spam” on.
[1] This will never happen. If it does, no one will use it.
We’re not in disagreement.
I meant that as a generic “you”, not to say that you have redefined the term. That is, when someone redefines the term, I have a problem with them continuing to call it what it really no longer is. That’s why we are where we are today.
And ISPs have to have operational definitions of things. The problem is when the operational definition isn’t the actual definition.
We need to have a single, clear definition of spam. Maybe we even settle on someone’s operational definition (so long as that isn’t “mail my users don’t want”). Then we need to stick with it.
“False positive” is not a medical term. It’s a statistical term – a more memorable description for what’s also known as “type 1 error” or an “alpha error”.
The context in which it can be used is when you’re trying to decide between two exclusive hypotheses for each of a number of items by performing a test on each item. You define one hypothesis, the “null hypothesis”, as the default state that you assume an item is in, unless your tests suggest that the other hypothesis, the “alternative hypothesis” applies.
For each item you then apply a test. The test may return “negative”, meaning that the test suggests that for the item the null hypothesis applies, or it may return “positive” meaning that the test sugests that the alternative hypothesis applies.
In the case of spam filtering the items being categorised are individual emails, the null hypothesis is that the mail is “not spam” and the alternative hypothesis is that the mail is “spam”.
The important thing here is that the test is performed for each email. So any attempt to use the term “false positive” in any sense other than for one individual email is probably meaningless, and certainly outside the standard usage of the term.
In the case of source IP based blacklists that means that the test needs to be performed on each email. For each email the test may return positive (“spam”) or negative (“not spam”). It is meaningless to talk of an entry in a blacklist as being a positive or a negative test result, so it’s meaningless to talk of an entry being a false positive. “This listing is likely to lead to false positives”, sure, that’s a perfectly sensible statement, but “this listing is a false positive” is completely meaningless.
The only thing that can be meaningfully looked at is the result of the test as applied to an email. If the email is wanted, or requested or solicited or however you choose to define “not spam”, and the email was categorised as “spam” by the blacklist, then that is a false positive. Even if the source IP address sends a huge amount of “spam” and tiny amounts of “non spam”, when the test is applied to a “non spam” email and returns a positive result, that’s a false positive.
(It’s certainly possible to attempt to redefine “spam” such that any email from a given IP address is always “spam” or always “not spam”, regardless of the content of the email, whether it was requested or not, whether it was wanted or not (a variant of this is to define “spam” as “mail from any IP address listed on this blacklist”). But doing so is redefining the term so far from what it is normally taken to mean that it steps well outside the bounds of reasonable definitions to dishonest logic-chopping.)
So, looking at the phrases you give above that people have presented as possible definitions of a “false positive from a spam filter”
Huey’s definition (“a nonspam mail stopped by a spam filter”) is clearly correct in general.
Al’s definition seems entirely accurate within the bounds of his experimental design.
“mail that I wanted which got blocked” is clearly correct, if you define “non spam” mail as “mail that I wanted” – which is tricky to measure mechanically, but does match most peoples usage of the term.
“That which is listed, but doesn’t meet the list’s criteria” is clearly wrong, for the reasons discussed above. It’s talking about false positives in the context of blacklist entries, not individual emails. It would be a perfectly good definition of “bad listing”, but it cannot be used as a definition of “false positive” in a spam filtering sense – and if you try, and try to reason from there you’ll end up with contradictions and meaningless results.
Oh, I forgot to say that the quote you have from Al is not talking about false positives in spam filters at all. It’s talking about non-spam email making it into his list of emails that is supposed to be 100% spam. So his use of it is accurate in the context of his experimental design, but not directly relevant to false-positives-in-spam-filters.
The problem I am having is this:
I don’t see how anything useful can come from reporting statistics about “false positives” and so on if the design-intent of the blacklist in question is not considered.
(And by the way, the Wikipedia entry resembles what I thought I learned in statistics abort Type I and Type II errors.)
If a blacklist is designed to stop email with a certain characteristic (that does not mention content, desirability, “requestedness” or anything else but that one named characteristic then a performance report that does not include (as a minimum) that characteristic has to be at best meaningless and at worst misleading.
And I hope we are not burning all this energy on meaningless.
I don’t see how anything useful can come from reporting statistics about “false positives” and so on if the design-intent of the blacklist in question is not considered.
The most common application of a DNSBL is to score, tag, filter, or outright block spam. The intent of the DNSBL operator is meaningless, as well as effectively untesteable. Even if it weren’t, virtually no one is using DNSBLs to test if they do what they say they do, they use them to block spam.
In that context, the only meaningful evaluation of that DNSBL in a vacuum is “does it block spam, and not block not-spam?”. Next to other DNSBLs, there may be cumulative effects – perhaps one DNSBL blocks certain types of spam that other DNSBLs miss. But very few people give a toadshit whether or not a DNSBL is operating in congruence with its public policies, quite so much as they care whether it blocks spam and doesn’t block nonspam.
Let me retreat to what I hope will be a more emotion-free argument.
I am trying to but together a system (implies more than one component) to deliver my inbound mail to my electric “inbox” as quickly and as clutter-free as possible.
In measuring (we used to say “benchmarking”) the various candidate components, some need to be measured with a ruler that is defined in terms that report on “quickly” (how fast are the transfer rates, how many operations per second, how big (maybe)., and so on.
Other components need to be measured in terms of how fast they learn (for a Bayesian filter, perhaps), how well they stop email from well-know spam sources, how well they stop email from sources that should never send email, how well they stop spam and so on, all of which will be functions of what the particular component is designed to do.
Measuring against what they should have been designed to do seems suboptimal since I don’t have useful definitions for “should” and all I can do is measure what they do.
If somebody else would like for me to use their report, it will have to report in useful terms. If somebody wants me to believe that a candidate component is “bad” I’d like the report is in terms that make sense to me.
“If somebody else would like for me to use their report, it will have to report in useful terms. If somebody wants me to believe that a candidate component is “bad” I’d like the report is in terms that make sense to me.”
That ‘graf certain has something wrong with it. I’ll try again.
If somebody else would like for me to use their report, it will have to report in terms useful to me.
If somebody wants me to believe that a candidate component is “bad” I’d like the report to use terms that make sense to me. (Terms that make sense to the reporter are not necessarily sensible to me.)
As a long time participant of NANAE please let me add my 0.02 worth to the definition of “spam”.
Emails I consider “spam” are almost always a solicitation of some kind from an sender not known to me.
That most emails of this nature are sent with the intention the recipient will purchase a product or service, visit a website (where the owner of the site has something for sale or is getting paid by advertisers on an per-exposure and/or per-click basis.
Direct sales isn’t the only thing; some spammers (often they don’t realize they’re spamming) want to push their point of view on some ideological topic (but I seldom see any of those) but even at that if you drill down deep enough into such solicitations if you visit the web site or call the phone number in the email you’ll likely find requests for donations and/or books, tapes, CDs and DVDs offered for sale.
Thus when all is said and done, “spam” is largely in the eye of the beholder and it’s going to be a long time (if ever) to be able to design and implement a spam-control system that stops all spam while allowing all wanted (ham) email.
-Herb