Rethinking SpamGourmet's Approach

Discussion re sg development. You don't have to be a developer.

Rethinking SpamGourmet's Approach

Postby ghost » Wed Jun 30, 2004 4:32 am

I love spamgourmet, it works wonders, and within it's concepts lie the key to truly making spam unprofitable. So, on our quest to make spam unprofitable, let's ensure that we don't go broke in the meantime!

Right now, if a spam message is going to be 'eaten' we have to accept the message. Then it is analyzed and dropped. Well, what if we could perform this analysis BEFORE the message is even sent.
Using a sendmail milter, we can analyze the address provided to the RCPT TO: SMTP command, and see if this address is allowed any more e-mail. If it is, the spamgourmet milter will accept it as it always has. If the address is no longer valid or has expired, we return an SMTP error code stating that 'the user account no longer exists." This will tell the sender the account is dead. By doing this, we most likely have removed the address from a particular spammers list, so we won't get asked again, but also, the message was never sent to us, we blocked it, before the SMTP DATA command could be issued. Thus saving loads of bandwidth and processor time. (We didn't have to parse any message!)

I'd like to write this milter up in C. A PERL version is just as likely, but a C one is native and would be much more efficient. It'd give us alot of flexibility, and save alot of bandwidth from being needlessly wasted.

What are your opinions!? :)

Posts: 6
Joined: Wed Jun 30, 2004 4:25 am

Postby maratheamit » Wed Jun 30, 2004 1:29 pm

Something similar has been on my mind for some time. Two things need to be considered however:
1. Before implementing this we need a detailed breakdown of the spam eaten by SG to decide whether writing/testing this feature is worthwhile. If most of the messages are being eaten only after database lookups it is probably not worthwhile spending time on trying to reject incoming mail at the MTA.
2. This makes SG less portable which is not an immediate concern. But long-term Josh would like other sites to run their own service with the SG code. To preserve portability we should maintain redundancy in the spameater perl script. i.e. should work regardless of whether some messages are being rejected at the MTA.
Posts: 82
Joined: Fri Aug 29, 2003 2:35 pm

Postby ghost » Wed Jun 30, 2004 1:59 pm

well, if a milter was doing filtering at the mta level, and you continued to run the spameater script as is, you'd run into trouble. If the milter recognized a new address, and marked it in the database, for say 3 messages. then the spameater script got it, and then said "this address already exists, let me decrement it," that one message counted as 2. That could get very complicated.

You're right about the loss of portability, sg as a milter would only work on fairly recent sendmail distros. That would be one downside to the milter approach, BUT on the upside, we wouldn't silently be eating mail either. We would be actively telling the sender, "hey this account no longer exists." That is a good thing. The sender would have to take yet another address off of their lists. Reducing future spam to the disposable email address, saving bandwidth even more. the bandwidth savings aren't just for sg either, anyone who runs the sg code will be saving bandwidth as well.

Anywho, there are benefits and pitfalls in either approach. Hence why public discussion is important :)

Posts: 6
Joined: Wed Jun 30, 2004 4:25 am

Postby josh » Wed Jun 30, 2004 4:29 pm

my pipe dream for this would be to

a) continue refactoring the sg mail handler code
b) use one of the Perl Milter modules to wrap the code for in-process sendmail deployment while maintaining the ability to run stand-alone

does that make sense?
Posts: 1371
Joined: Fri Aug 29, 2003 2:28 pm

Postby ghost » Wed Jun 30, 2004 5:54 pm

well then, what would make most sense is to rewrite the current functionality of sg into a single perl module. Then we can write a SG milter and an updated spameater. Two separate programs, but both using the same underlying module to provide database access and parsing routines. The milter and spameater script would simply be a means to access the services provided by How's that sound?

Posts: 6
Joined: Wed Jun 30, 2004 4:25 am

Postby maratheamit » Thu Jul 01, 2004 1:48 pm

I second the suggestion for a module which can then be called from sendmail or an independent process. The original impression I had was that the milter would perform only the most simple rejections (like messages coming to non-existent usernames or expired addresses). But after some thought, I agree that if we modularize the SG filtering functionality it should be quite simple to incorporate it into a sendmail milter. That will definitely save us bandwidth by rejecting messages before getting the DATA command.

The only concern I have is efficiency. We have been talking for some time about daemonizing the spameater process by having it process messages only every few minutes (and not as they come in). The milter idea is less efficient in that respect since it will launch multiple perl interpreters. But I guess that is no less efficient than what we are doing now.
Posts: 82
Joined: Fri Aug 29, 2003 2:35 pm

Postby ghost » Thu Jul 01, 2004 2:14 pm

The C milter api provides everything as a thread, so there is only one process. I wasn't sure how the perl milter modules work internally though. However, after some google'ing I have found this from the Sendmail::Milter readme page.

"With this module, you can define and register Perl callbacks with the Milter
engine. This module calls your perl callbacks using interpreters from a
threaded persistent interpreter pool."

So there will actually be ALOT less overhead this route, since we won't have to spawn off new interpreters for each request, they will be pooled and called from as needed. More the way Apache's mod_perl works. If I understand it all correctly.
Posts: 6
Joined: Wed Jun 30, 2004 4:25 am

Postby josh » Thu Jul 01, 2004 5:04 pm

yeah, that was my understanding of how the Perl Milter modules (at least some of them) worked - for me, that's the primary attraction of going to this model in our current production environment.

The "bounce vs. eat" debate has flared up at different times here and there are concerns on both sides independent of bandwidth -- for instance, many sites will deactivate an account if the email address associated with it appears invalid (e.g., I have online shopping accounts that I re-activate by refilling addresses on sg every Christmas season). For this reason, we'd probably couldn't start bouncing wholesale without negatively affecting many users (now if we could figure out a way to not bounce and still not accept the message body... I can't think of a way to do that, though). Perhaps we could bounce "hidden" addresses, or let the users set the option, etc.
Posts: 1371
Joined: Fri Aug 29, 2003 2:28 pm

Postby ghost » Thu Jul 01, 2004 5:30 pm

This is a take on the grey listing concept, but rather than say "this account no longer exists." and instead of eating it, we could delay indefinitely. Greylisting tells the sending server "yeah, i'll accept that, just come back and ask me in X minutes" so we'll just keep telling the remote server to come back in a 2^32 minutes.

Thats probably not very nice to do though.

Could we potentially, return an error code that won't result in a bounce, but prevent the DATA command from being executed?

Could we drop the connection?

Could we just break protocol, and not allow an expired address to be used with a DATA command. So when the remote side provides "DATA" we just return command not recognized? We could probably test if this creates a bounce.

Posts: 6
Joined: Wed Jun 30, 2004 4:25 am


Postby nsomos » Fri Jul 09, 2004 3:47 pm

I like the idea of letting users decide for themselves the bounce
versus eat question.

I must admit unfamiliarity with the SG code, or I wouldnt ask the
following question.

Do you currently track the number of first-time eaten for addresses
as opposed to the number second and subsequent times eaten?

If not it would be good to distinguish between these two.

The reason is, provided the percentage of second and subsequent
eats is high enough, there are techniques which can speed up what
would otherwise be an expensive database lookup.

There are ways to implement near perfect hashes using both
forward and reverse 32 bit CRC's with varying number of buckets.
Once an address has been eaten for the first time, it can be
added to the hashtables. Prior to any database lookup, the
hashtables are referenced to determine if this address is one
which is being eaten. I can go into greater detail later. If people
'reload' an address, then a similar operation removes the entry
from the hashtables based on address being 'reloaded'. I am
aware of the potential for false positives and know how to
prevent them. I can provide more details later if it is called for.

None of this is worth the bother unless the percentage of second
and subsequent eats is high enough.

Another thought that comes to mind ... if you are going to eat a
message, would it make sense to delay the response to the
RCPT TO, to some minutes? If you can make sending spam
more expensive or slower, it reduces its worth to spammers.
If you are going to eat messages anyway, they really are likely
spam. Perhaps there are other techniques which will increase
the cost to spammers (or those that faciliitate spam) which
could be employed only in those cases where messages will
be eaten.
Posts: 10
Joined: Wed Jun 23, 2004 3:06 pm

Postby ghost » Fri Jul 09, 2004 4:00 pm

I like your idea for tarpitting eaten addresses, so people sending mail to an address marked as 'eaten' get stuck for some period of time. It's not necessarily a bad thing for people to view us as hostile to spam :-D

I agree with your hashtable description, thats what I planned on doing. I'd think keeping the milter very small, and letting a portable backend process do more of the grunt work is the ideal way to do this.

Posts: 6
Joined: Wed Jun 30, 2004 4:25 am


Postby nsomos » Fri Jul 09, 2004 7:31 pm

In those cases where messages will wind up being eaten,

perhaps the reply to the RCPT TO could
(in addition to being tardy by a couple minutes)
be something like ...

452 Requested action not taken: insufficient system storage
..... or .....
552 Requested mail action aborted: exceeded storage allocation

This also gives plausible deniability for why the response
is so sluggish. The system is clearly overloaded from SPAM!

This might also work well for those people who reload shopping
addresses for the holiday season, but let them lie fallow otherwise.

This may be the sort of thing you might expect if a mailbox
or system is jammed full of spam.
Posts: 10
Joined: Wed Jun 23, 2004 3:06 pm

any progess?

Postby henrik » Wed Nov 02, 2005 5:07 pm

It's been a while since this topic was actively discussed.
Did anybody try to implement it?

I would realy like to see such a feature on spamgourmet.

Preferably with a by-address selection wether you want
to get those surplus emails eaten, or rejected with a 4xx or 5xx code.
Posts: 1
Joined: Wed Nov 02, 2005 2:46 pm

Postby milkbadger » Sat Feb 23, 2008 8:45 pm


wondering if SMTP rejection is still being considered as a feature. i personally would like to see it -- prior to becoming a spamgourmet user, i used /etc/aliases on my (now defunct) home mail server to achieve the same effect. obviously it may have performance implications -- i'd be willing to offer my own account for any analysis.

Posts: 16
Joined: Sat Feb 23, 2008 8:26 pm

Return to Developers

Who is online

Users browsing this forum: No registered users and 2 guests