captchagen

Discussion re sg development. You don't have to be a developer.

captchagen

Postby Guest » Sat Nov 01, 2003 5:02 pm

Syskoll,
I just had a change to play with captchaget -- pretty cool!!

I added a sudirectory called web to the captcha subdirectory in the test home dir. This directory can be browsed at test.spamgourmet.net/tmp

I'm working out how to plug it in. We'll need to move the account signup form onto its own page so that we don't need to have one of those images on every page -- alternatively, we could leave the form where it is, then add the challenge as a last step (probably moving the password part there, so we don't have to send it back to the browser for resubmission from the challenge page).

I also have the challenge of not having root on the webserver machine, and so not being able to easily install Image::Magick or File::Temp (I can work around those problems eventually, of course).

Here's one approach --

a) when the challenge page is created, it calls captagen with a safe random filename (perhaps a hash of the proposed username) that creates the file in the sg web folder. It receives the word (probably lc()'s the word), and uses a private hash algorithm to create a hash that is included as a hidden input (I guess this could be the filename, for that matter) -- the image is included on the page

b) when the user submits the challenge form, the user typed word (probably lc()'ed) is hashed using the private algorithm to see if it matches the hidden input. If so, create the account.

The only wrinkle if we have to run captchagen on the mail server machine (where we have root) is that the call from the web code to captchagen will have to be remote.

This was just my first stab at it -- there may be a cleaner way. I'd love to not have to store the images, particularly if we're on the webserver machine (they're pretty big), which would involve causing captchagen to return the png straight to STDOUT instead of the word. I haven't been able to work out an implementation strategy for that, though.
Guest
 

Captchagen and web server on 2 different machines?

Postby SysKoll » Sun Nov 02, 2003 5:10 am

We'll need to move the account signup form onto its own page so that we don't need to have one of those images on every page -- alternatively, we could leave the form where it is, then add the challenge as a last step (probably moving the password part there, so we don't have to send it back to the browser for resubmission from the challenge page).


Agreed. We definitely need a separate registration page.

I also have the challenge of not having root on the webserver machine, and so not being able to easily install Image::Magick or File::Temp (I can work around those problems eventually, of course).


The simplest solution would be to convince the HE sysadmins to install on the web server machine the modules required by captchagen. However, if that's not possible, we could put captchagen on the mail server and put a link from the registration web page to the CAPTCHA image, but that leave the problem of the quizzword (the word that the user has to read in the CAPTCHA), which is returned by the routine: how do we get it in this scenario? I see at least two ways:

1. A simple solution that comes to mind is to have the mail server export a certain directory through NFS and have the web server mount it. This way, the quizzword returned by captchagen can be saved in an NFS-exported local file and that file can be read by the web server. Not very clean, but extremely simple.

2. Barring that, we can have a simple service running on the mail server. The mail server does not export anything. When the registration routine running on the web server wants to validate the quizzword, it sends both the typed word and the hash or ID of the captcha to that service on the mail server. The service retrieves the correct info and replies VALID/INVALID. More complex.

As for your approach, I agree with one reservation (see below).

a) when the challenge page is created, it calls captagen with a safe random filename (perhaps a hash of the proposed username) that creates the file in the sg web folder. It receives the word (probably lc()'s the word), and uses a private hash algorithm to create a hash that is included as a hidden input (I guess this could be the filename, for that matter) -- the image is included on the page

b) when the user submits the challenge form, the user typed word (probably lc()'ed) is hashed using the private algorithm to see if it matches the hidden input. If so, create the account.


Actually, we cannot send to the user a hash of the quizzword. Remember, some day, the SG code will be public, and everyone will be able to see what hash method you use for producing that hidden input. Then, considering that captchagen's dictionary has only 14,000 words, it will be very simple to hash all the dictionary words and compare their hashed value to the one sent by the registration page. And voila, Spammy got himself an SG registration script. Double plus ungood.

Incidentally, the attack described above is the reason why /etc/password on Unix doesn't contain hashed passwords anymore. Dictionary attacks on these hashes were way too common.

What we can do is send (e.g., as a hidden input) the name of the local server-side file that contains the quizzword. Since the name is unique and the file is short-lived, there is no risk of attack here.

So your approach becomes (I italicized the diffs):

a) when the challenge page is created, it first verifies that the proposed user name is not already taken. Then it calls captchagen with a safe random filename (perhaps a hash of the proposed username and of the localtime() timestamp, or we can use File::Temp if available) that creates the file in the sg web folder. It receives the quizzword (probably lc()'s the word), saves the quizzword in a temp file and put that temp file name in the web page as a hidden input (I guess this could be the filename, for that matter) -- the image is included on the page

b) when the user submits the challenge form, the user typed word (probably lc()'ed) is compared to the content of the word in the server-side file. If so, create the account.

The determining factor will be whether or not you can run captchagen on the web server. Please let us know ASAP.
-- SysKoll
SysKoll
 
Posts: 893
Joined: Thu Aug 28, 2003 9:24 pm

Postby josh » Wed Nov 05, 2003 2:14 am

again - sorry about the delay (real busy at $DAYJOB)

I will make captchagen work on the web server somehow, I decided.
josh
 
Posts: 1371
Joined: Fri Aug 29, 2003 2:28 pm

Great decision

Postby SysKoll » Wed Nov 05, 2003 5:04 am

Josh,

I am relieved. This decision will make our job way easier.

If only lawyers at my $DAYJOB were so resolute... (I am waiting for our IP Legal to tell me whether or not it's OK to publish a bloody Java paper and it's been weeks already!)
-- SysKoll
SysKoll
 
Posts: 893
Joined: Thu Aug 28, 2003 9:24 pm

Postby maratheamit » Thu Nov 06, 2003 12:24 am

why not move the web server to the dedicated machine (on which josh has root permissions)?
maratheamit
 
Posts: 82
Joined: Fri Aug 29, 2003 2:35 pm

Postby josh » Thu Nov 06, 2003 2:26 am

maratheamit wrote:why not move the web server to the dedicated machine (on which josh has root permissions)?


The two reasons I have for not moving are:

1) I *think* it's a better bandwidth deal - the communication b/t the webserver and mailserver is pretty light, and the webserver occasionally lights up in terms of html and png. I believe bandwidth overruns are cheaper on the webserver. I haven't run these numbers, I'm just estimating.

2) The stupid mistake I made by using the local crypt() system call to encrypt passwords, and the fact that the local libraries are different on the two machines. Since I caught that, I switched over to MD5, both for new accounts and for anyone who logs in. By now, most have probably been converted, but there's probably still thousands that haven't. They do have the auto-password reset thing now, though.

3) I don't think the website is very CPU intensive, but I do believe general performance is better when it's not on the mail box. Also the mail server gets slammed by scripts a few times a week (I'm not sure how long the episodes last, but it's definitely long enough to lose a few web surfers), and the web box can still gracefully dish out pages to people who aren't logged in because it uses lazy db calls for the stats and other dynamic content.

oh, that's three
josh
 
Posts: 1371
Joined: Fri Aug 29, 2003 2:28 pm

Postby josh » Thu Nov 06, 2003 3:00 am

Josh:
I will make captchagen work on the web server somehow, I decided.


The machine:
Perl 5.005 required--this is only version 5.00401, stopped at [...]/File/Temp.pm line 122.


I've been getting my butt kicked by computers all day -- I'm starting to get used to it :)
josh
 
Posts: 1371
Joined: Fri Aug 29, 2003 2:28 pm

Postby SysKoll » Thu Nov 06, 2003 5:46 am

Josh,

I've been getting my butt kicked by computers all day -- I'm starting to get used to it


You're using Windows at work and the patch of the patch is not working? :P

Perl 5.005 required--this is only version 5.00401, stopped at [...]/File/Temp.pm line 122.


So File::Temp is not working, eh? Shoot. OK, I'll replace it with a hash of the PID and the current time stamp. Can you please copy somewhere in ~sgtest the MD5 code that you're using on the web server? I'd like to take a look at it and use that hash since it's provably working on the web server.
-- SysKoll
SysKoll
 
Posts: 893
Joined: Thu Aug 28, 2003 9:24 pm

Postby maratheamit » Thu Nov 06, 2003 5:48 am

We don't need to create two files (as Syskoll is suggesting). What we can do is have the captchagen script create the distorted image in a temporary file. This file can be named after the desired username (No need for File::Temp). The word returned by captchagen can be appended to a secret key (which can live in a config file on the web server). The concatenation of the captchagen word and the secret key can be hashed and this hash can be included on the challenge page as a hidden argument.

The check after the user submits the response is simple: append the word returned by the user to the secret key and compare the hash of this concatenated string to the hidden argument. If they match, register the user. Otherwise, delete the image file. There also needs to be some mechanism to delete image files for users who never respond to the challenge. And there is a potential race condition when two new users ask for the same username: the best way to handle this is for captchagen to create the image file and to throw an exception when the file already exists.

Also, is there a pressing reason to restrict ourselves to dictionary words? I am uncomfortable with the small space of 14K words. Why not a random 8 letter word? That will also get us away from having to worry about inappropriate/offensive words in the dictionary.
maratheamit
 
Posts: 82
Joined: Fri Aug 29, 2003 2:35 pm

Secret hash no good if dictionary

Postby SysKoll » Thu Nov 06, 2003 5:10 pm

Amit,

You're right. We can build a string by concatenating a secret key and the quizword returned by catchagen, create a hash, and send the hash to the user as a hidden value field. Without the secret key, the chances to find the right hash are pretty slim. Why didn't I think of that? The code is public (or will be) but the key can be a secret picked by the site owner.

Josh, this should simplify programming.

As you suggest, I'll replace the File::Temp name with a file name derived from the proposed username and maybe a timestamp. I'll not throw an exception because this will generate an error if you reload the page. We'll need a cron job to clean up the temp dir for users who never complete the registration.

Finally, regarding dictionary vs. random quizwords, there is no compelling technical reason. A captcha containing a dictionary word is more user-friendly than a random string, that's all.

You are wrorrying about the number of possible dictionary words (keyspace size= 14,000) as opposed to the keyspace size of random words and letter. But look: when we show the captcha to the user, we also send him that hash generated with a secret key. Assume the sercret key is 8 chars. An 8-char key (letter + digits) is (26+10)^8 = 2.8 * 10^12 possibilities, a very generous key space. Then we have the quizword itself, another factor of 14,000, for a total of 2.8 * 10^12 * 14000 = 4*10^16. So I think we're safe.

With such a large key space, I think we can afford to use disctionary words for better user-friendliness.

But just to make it safe, I'll implement both possibiluities
-- SysKoll
SysKoll
 
Posts: 893
Joined: Thu Aug 28, 2003 9:24 pm

Re: Secret hash no good if dictionary

Postby maratheamit » Fri Nov 07, 2003 3:16 am

SysKoll wrote:As you suggest, I'll replace the File::Temp name with a file name derived from the proposed username and maybe a timestamp. I'll not throw an exception because this will generate an error if you reload the page. We'll need a cron job to clean up the temp dir for users who never complete the registration.

I had not thought about users reloading the page. Throwing an exception would be a bad idea for that reason. And a cron job is the simplest solution for cleaning up abandoned image file.

But we still need to account for the race condition of two new users asking for the same username. In such a case, captchgen should not overwrite the existing image file. And this case should be handled differently from the reload case. To captchgen they will look identical but maybe the cgi code can make the distinction. Any ideas? This is a rare enough case that we may decide not to do anything about it but we should at least expend the effort to see whether there is a simple fix.
maratheamit
 
Posts: 82
Joined: Fri Aug 29, 2003 2:35 pm

Re: Secret hash no good if dictionary

Postby maratheamit » Fri Nov 07, 2003 3:41 am

SysKoll wrote:Finally, regarding dictionary vs. random quizwords, there is no compelling technical reason. A captcha containing a dictionary word is more user-friendly than a random string, that's all.

You are wrorrying about the number of possible dictionary words (keyspace size= 14,000) as opposed to the keyspace size of random words and letter. But look: when we show the captcha to the user, we also send him that hash generated with a secret key. Assume the sercret key is 8 chars. An 8-char key (letter + digits) is (26+10)^8 = 2.8 * 10^12 possibilities, a very generous key space. Then we have the quizword itself, another factor of 14,000, for a total of 2.8 * 10^12 * 14000 = 4*10^16. So I think we're safe.

With such a large key space, I think we can afford to use disctionary words for better user-friendliness.

But just to make it safe, I'll implement both possibiluities

The secret key exists just to simplify programming on the server side. After all, we could maintain the association between the image and the captcha word in a seperate file. What I am thinking about is someone attempting a dictionary attack on our scheme by collecting and storing the captcha image files.

If someone writes a bot to register 14000 unique usernames he can collect 14000 image files and going through them at a rate of a few hundred per day infer a large fraction of our dictionary in a matter of days. That still does not get him all the way to an automated registration script because the captcha background changes randomly, but i don't know how much of a deterrent that is to a good programmer.

I know I am being paranoid here, but it helps to think about how our scheme can be broken. And I agree with the user-friendliness argument Syskoll makes. So we can stick to dictionary words. However, we should do at least one of two things:
1. change the captcha background every few weeks.
2. include a 4 digit random number in the captcha image (this would be the equivalent of salting the password file on unix).
maratheamit
 
Posts: 82
Joined: Fri Aug 29, 2003 2:35 pm

Changing background + numbers

Postby SysKoll » Fri Nov 07, 2003 5:05 am

Amit,

I like your thinking. You're paranoid and you don't trust either man or machine. Good. :-)

The attack you describe -- storing thousands of random captcha images -- would not help considering the random background. Suppose you alleviate the randomness and can identify a captcha image and match it with a previously loaded one. So what? You're looking for the word inside the image. Which would require either some very clever, non-yet-existent form of OCR. So that attack wouldn't help much. Unless you're suggesting that Spammy would actually go and catalog these images, entering the quizword in a database? That's a lot of work. If spammy was that much of a worker, he'd have a honest job.

Nevertheless, it doesn't hurt to apply the extra safety measures you talked about. (See, I told you I was right with you paranoia-wise). I have space for 3 digits in the image (a 4th one would require a larger rectangle), so I'll add them at the end of the words. As for the background image, it's really just a static image. I am confident one of us could come up with a way to write a Gimp or ImageMagick script doing what I did to create the image, with different random factors. So that can be done. If we have the script, we can put it in a cron job or even upload the new image manually every few weeks. Any taker for the Gimp script?

But we still need to account for the race condition of two new users asking for the same username. In such a case, captchgen should not overwrite the existing image file. And this case should be handled differently from the reload case. To captchgen they will look identical but maybe the cgi code can make the distinction. Any ideas? This is a rare enough case that we may decide not to do anything about it but we should at least expend the effort to see whether there is a simple fix.


I was thinking that when the image file name is generated (which should be done by the caller of the captchagen() subroutine), two elements can be concatenated to the user name: a timestamp, down to the second, and a random number (say 4 more chars). That should guarantee unicity. Then, of course, when the response to the captcha challenge is received and verified, whoever hits the database first wins.

Note that we all assume here that the file name is available when we call captchagen(). That might not be the case. Josh proposed a separate page for registration that includes the usual entry fields (for username, password, forwarding address) and the captcha. What do we do in this case? I propose to just give the file a name made of a random string + timestamp.
-- SysKoll
SysKoll
 
Posts: 893
Joined: Thu Aug 28, 2003 9:24 pm

New version of captchagen

Postby SysKoll » Mon Nov 17, 2003 5:52 am

Josh,

I uploading a new version of captchagen. It contains a work-around for the absence of the tmpnam() routine: a Perl module called sgutils.pm contains a tempfile() routine that uses only sysopen() to create a temporary file. I believe this works even on the oldest Perl. It works well and supports multiple instances of the routine running simultaneously. The sysopen() call assumes Unix, let me know if you ever want to port SG to, say, an Atari ST. :-)

Also, as suggested by Amit, the CAPTCHA word now has 3 digits appended to it.

Take a look at /home/sgtest/captcha/captchagen-1.1.
-- SysKoll
SysKoll
 
Posts: 893
Joined: Thu Aug 28, 2003 9:24 pm

Postby josh » Mon Nov 24, 2003 7:54 pm

OK - I'm definitely getting there. Image Magick is not installed on the web server, but that doesn't bother me too much.

I've copied over the convert binary, and now it can't find libtiff.so.3 -- so I copied that over, too. I should already know this, but can anyone tell me how to make it findable by convert without putting it in /usr/lib (or some other directory that I don't have rights to)?
josh
 
Posts: 1371
Joined: Fri Aug 29, 2003 2:28 pm

Next

Return to Developers

Who is online

Users browsing this forum: No registered users and 23 guests

cron