Project! Surprise!
Apr. 24th, 2004 06:57 pm(Zip continuted)
So here's the deal on zipping. Though they probably weren't the first to have this idea, Dario Benedetto and a couple other Italians published this article in, of all places, Physical Review Letters in 2002. This basically said, "hey! If compression algorithms can compress more the more repetition there is, then a text by author X tacked onto the end of another text by X could probably be squeeze smaller than it could if it were tacked onto the end of something written by Y!" Or, if you prefer,
"Where L(x) indicates the length in bits of the zipped file X. ... We shall look for the text Ai for which the difference L(Ai + x) - L(Ai) is minimum." ("Language Trees and Zipping", Benedetto, Dario et al, Physical Review Letters 88:4, 2002, 048702)
This was obviously the easiest method of all to test, and it was actually the one I had the most hope for, given that it combined an idea that made sense with an implementation that even I would find hard to screw up. So there's an Excel table. Unfortunately, it's one that I can't open on my machine, because I can't open Excel again until I find my Office XP CD (raggafrackin'...). But it exists. I printed it. It has results for Zip, RAR, and gzip. There are 8 rows (for the "known" text from each author) and 17 columns -- one for the compressed size of the "known" file alone, then one for the known file and each of the 8 test files zipped together, then one each for the delta (that is, L(225) - L(225+376)).
Don't see that there's much else to say about that, other than to track down zip algorithm details.
3. SVM with de Vel's measurements
Oh, man. Where do I start? Okay. A Support Vector Machine (SVM) is a kind of kernel machine (I've got references galore on what those are, but it's a pattern-recognition thing) and it seeks to minimize generalization error -- which is to say, you give it a bunch of training vectors and it tries to find the best separating hyperplane to distinguish the "yes" examples from the "no" examples. The distinguishing feature (I BELIEVE; BETTER VERIFY) is that giving it a huge amount of features (like, as you'll see below, the two-hundred-some de Vel suggests) doesn't necessarily lead to it developing a really detailed rule which will only recognize exactly the training examples, and not other things which should fall into their class (overfitting); instead, it figures out which are the "support vectors" or basically the really important ones -- "the number of free parameters used in the SVM depends on the margin that separates the data and does not depend on the number of input features".
O. de Vel and his peeps wrote this paper, "Mining E-mail Content for Author Identification Forensics", which I considered to be the closest thing to my discussion group study, cause, you know -- online communication, spelling errors, casual nature, blah-de-blah. So I pretty much followed their lead. I used the same kind of software, SVMlight, developed by Thorsten Joachim. I used the Matlab interface to SVMlight developed by Anton Schwaighofer (http://www.cis.tugraz.at/igi/aschwaig/software.html).
In order to simulate de Vel et peeps' work, I used as many of their style marker attributes as were relevant. Those which distinguished tab spaces from other white-space characters were unnecessary since tab spaces were not present in my filex, and the structural attributes (greeting, signature text, etc) were also pretty specifically e-mail-related. Still, that left, oh, lots of features, which I tabulated using yet another cool Perl script.
(A note from not our sponsor at all, but a nifty sidenote -- thinkgeek.com has t-shirts that say "Perl Gerl". It's a bad color, but I like the t-shirt. If only they still had the "Geek" work shirts. That's what I really want.)
The attributes I used were as follows (M refers to the number of words in the document; V to the number of types, or different words; C to the number of characters; "hapax legomena" is the number of types that occur only once in the text; function words are those like "the", "if", "to" -- a list can be found by googling for "function words"):
- avg. sentence length
- avg. word length (in chars.)
- vocab richness (V/M)
- Total # of function words -- EXPLAIN
- Function word frequency distribution
- # short words (three or fewer letters)/M
- count of hapax legomena/M
- count of hapax/V
- Total # alphabetical chars/C
- total # upper-case chars/C
- total # digit chars/C
- total # white-space chars/C
- total number of punctuations/C
- word length freq. distribution/M
Now. Are we having fun yet? I think we are! You know why I think we are? Because I've written enough to justify taking a break to go kill things again! Here I come, Wizardry 8!