helvirago | Project. Yes, some more.

(Cusum, continued)

Samples of text ranging in size from 10 to 100 sentences were taken from the data using a Perl script. Gargle. Um. See, here's the deal. I finally wrote code that would measure sentence length & 23l & ivw and get the mean and get the deviation from the mean and do the cumulative sum, which I imagine I'll need to provide an example of. Then I had to figure out how to put together the files to plot and analyze. So, using information from this other person's Master's project, I decided to use four "chunks" of text -- one "known", one "unknown", and then two more "known" -- so that the chunk to be attributed was in the first half, cause that's what he said. So how big were the chunks? Well, that's what I'm saying -- 10 sentences, for a file length of forty, to 100, for a length of 400. Didn't really make any difference. What I ended up with, trying to minimize the influence of fluctuations in style over time, was to have the chunks actually be made up of 50 sentences, 10 from each Natter file -- so you'd have 50 sentences over the first five Natters by user 376, then 50 from Natters 6 through 10 from the "unknown" file, then 120 from Natters 11 through 22 by user 376. If you see what I mean.

Then I smacked those puppies into Matlab and plotted 'em, then scaled 'em -- yeah, see, you've gotta scale, or you get nothing, even though it might seem like scaling would make the whole thing unreliable. But no, scaling it is. So that was the method for Cusum.

1a. Weighted Cusum

Based on an article by some guys who I don't feel like looking up right now, I tried a modified version of cusum -- which is to say I took the same measurements, but then did numerical manipulation instead of graphical, um. Manipulation. Blah. Deal is, instead of measuring both sentence-length and 23ivw against the number of sentences, I expressed the 23ivw in terms of the sentence length. Yeah, it's kinda weird. Basically, they said... fucked if I know what they said, it's in my immense stack o' papers somewhere and I can't find it. Basically, they introduced the idea of a weighted cusum, so that

BLAH BLAH BLAH

Clearly I'll have to explain it at some point, but I really don't remember. I may just end up quoting.

So yeah, I used the same files as for the usual cusum, and then just did the math, and the math is kinda funky -- you find, like "weight of text A - weight of text B" and divide it by the square root of "deviation of A divided by variance of words in A + variance of B divided by number of words in B" and then check that against a standard t-table to see ifyou've got enough evidence to contradict your null hypothesis, which is that the whole file is written by the same author. Yeah, see? Confusing. And that doesn't even get into how you figure out the variance of a list of cumulative deviations. Confusing as all hell.

But the method part is pretty straightforward -- I did this, I used these formulas, I did the math. Moving on.

2. Zipping

Ah, yes, zipping. This is also somewhat controversial. Isn't that great? Care reported fair results for gzip and perfect results for cabinet manager, but since I can't afford Cabinet Manager we're going to ignore those difficult-to-believe and concentrate on what we can do with freely available software. Like... what I already have on my computer or what's installed on the University server. Which is to say, Zip, RAR, and gzip. Or... see, this is where I'm a little confused, cause maybe they use different algorithms or maybe they all use the same one, and if they do there's no point in citing them all, but if there's any chance they don't it'll look good for me to say, "hey look, I used all these different ones..." Anyway.

More after I pick up some drugs and food. Yay!

Anybody have any experience with driving cross-country for fun? Like how long it takes, how much time one should allot for driving per day to avoid falling asleep or having one's butt permanently fall asleep, how to figure out how much money, and (hardest of all) what one should do if one does not possess a car? Like, is it worth looking for somebody moving cross country, or might one be better off renting a car?