Friday, April 30, 2010

Entering the Twitterverse

Tagging handles in earnest indeed. In the meantime, I may tweet a few eureka moments about code or trolls.

While I was at it, I added an rss feed.


Monday, April 26, 2010

Still Distracted but Turning a Big Corner

In the last post I was all jazzed about an opportunity to make the data collection from my "select sources" run much faster.

Well, I'm pretty much at the other end of it and I'm quite happy about it. I did a major rework of the code. It's much better than before and my only regret is that I wish I'd thought of this speedup sooner.

The old code base accomplished the collecting (almost 8,000 articles and 415,000 comments) but it did the job kind of slowly and that was even with some good speedups I had spotted early. But if I had had what I've got now, it would have gone much faster. Also I would have totally minimized data collection from HA itself. Oh well. Better late than never.

In the last post I mentioned that I used code that would automate a web browser to navigate the web and collect HA articles from select sources. I also mentioned that by bringing up one browser instance and feeding requests to it, you will maximize the data collection performance.

The obvious speedup of course is to distribute requests across more browser instances - make it scale. But now we're talking about forking off objects from classes that encapsulate the browser-driving code. And once you create sub-processes you need a means to communicate requests to and retrieve results from those forked instances.

The tried and true method to accomplish this communication is to use a message queuing daemon. In a previous Ruby project I'd used a pretty slick one with a handy Ruby support library. No problem mon -  it's just a matter of finding the right places to insert this code. It was a lot of work but it made me take a hard look at the entire code base and throw away a lot of stuff that just wouldn't be needed anymore. And as I separated out code into Ruby modules by function, I was forced to come to a better understanding of instance variables. In my Repo class I totally overused class methods and was passing way too many parameters from method to method. Why? Way back in the beginning I thought it would be easier to test if I did it that way. Totally wrong. Thanks to this rework, the code in this project and my other projects will be so much better.

One of my select sources in the past really made me tear my hair out. I had only collected 4 or so articles from it because of all the trouble it gave me but now that's all changed. I've gotten much more than a handful from that source for March 2010 alone and I expect it to only get better.  And there's yet another select source out there that I didn't even bother to try because it was so weird and different from the others. I'm feeling pretty confident now. I will give it a shot but I have to do a couple things first.

I'm going to start tagging handles in earnest. This is really critical to separating the trolls from the heroes and then being able to make big picture analyses of troll behavior over the months and the election cycles.

Then I've got to scratch an itch about current month activity. So far I'm caught up to the end of March 2010. Normally I wouldn't add a month to my collection until 10 days into the next month. Why? Goldy doesn't shutdown comments to an article until 8 days or so after the article is published. So if I'm curious about some interesting activity in the current month, I have to develop a way to create a snapshot database with an eye on merging various snapshots into the real thing to save time when a month finally closes.

Wednesday, April 7, 2010

Opportunities, Distractions

One goal of this project was to  minimize data collection as much as possible directly from itself. I would have been a lot farther along much sooner if I didn't care about this. Why is this important?

I not only want to slice and dice troll behavior but I also want to learn all these awesome cool ruby-based software libraries. They've been a total joy to learn about and work with.

The data collection problem and the composition of its solution into classes has been pretty simple so far. There's a Session class whose job is to collect one month of's existence. The most important instance method of this class is add which at its heart is a for loop which starts at the first article of the month and ends at the last. Each article of course contains the comments which is the gold I'm after. But it's within the loop that the fun really begins.

I said I wanted to minimize data collection from the well, To do this I constructed a least cost route of sources from which to draw articles with being the highest cost, the last resort. So I have a Repo class (short for repository) that's blessed from a Session instance. I feed it an article number and a source and it tries to get the article. If it comes up short then the Session instance tries the next source in the route.

So far pretty simple. Not rocket science. Fun? Only for a geek, right? Well it gets even better. Some of my sources require calling a simple http get which is available from lots of ruby http libraries. This is how I get stuff, again as a last resort, from Butt simple and too boring. But a few sources, I call them select sources, require something a little more sophisticated. They require driving a web browser to go and get the article. Without getting into too much detail:

Making this go was just too much freaking fun for words.

So, after fist pumping and watching my code drive a web browser to navigate the web and get the stuff I wanted from it I quickly realized I'd done something a little dumb:

My code was opening and closing the web browser for each request to the web.. Ugh.. It's a lot faster Sherlock to just leave one browser instance up and feed all the many requests you want to it and then close it when the session is concluded. Another hour or two of searching for the right spot in the code to insert the needed changes and I had a big increase in performance.

Which is all a long winded introduction to the point of this post - there's yet ANOTHER opportunity (and yep, distraction from my initial goal) to make this puppy run even faster. Next post. I'm gonna code.

Tuesday, April 6, 2010

QA winds down. HA's most commented posts.

To kind of, sort of check on the quality of the data collection process, I whipped up a report of the most commented articles of each month of HA's existence. Then here and there I went to the well to see if what I have and what HA has matches up..

I've found tiny variances in, strangely enough, both directions.. I'll definitely have to look more thoroughly at those months where I seem to have MORE than what Goldy has. But so far the variances seem to be made mostly of spam.

Once I had that report working, I tweaked it to select one article at random from each month. Again, things seem to match up just fine. Even better than the most commented articles sample.

So confidence is high.

Anyhow the most commented posts merit further mention. Here's a pretty picture:

The first column is the year/month, the second is the article number and the third is the total number of comments for the thread. The most commented HA blog post, in March of 2005, was a knock-down, drag-out thread about Terry Schiavo - a major body-blow for the extreme right wing. Between trying to weaken Social Security, using Terry Schiavo's vegetative body as a prop, letting New Orleans drown, corruption/sex scandals, torture, dollar black hole wars of choice and the final, ugly meltdown of the economy - let us never forget what it means to have the right wing in control of this country.

Monday, April 5, 2010


One half of one percent of commenter handles, 80, belong to an elite group. They are handles each associated with over 1000 comments posted throughout the lifetime of I posted the top 10 a while back.

Browsing the list of 80.. All the names are quite familiar to me, troll and hero alike, and as the picture above shows, they account for over 62 percent of the comments posted. They represent's hard-core community of political junkies and hanger-ons and they include the most dedicated and unrepentant of right wing trolls.

A somewhat broader community of 306 handles have posted somewhere between 100 and 999 comments each but they only account for a little over 22 percent of all comments. Quite a few of them may one day soon pass the 1000 mark. Many of them enjoy participating in the HA brouhaha but are either fairly new to it, have dropped away or HA is far from a priority for them.

A modestly more numerous bunch have posted between 10 and 99 comments. I speculate that many of them are again new to the community or they may have started, felt lost in the crowd (or disgusted) and then quickly dropped away.

Lastly there's a very interesting group to me. They account for over 88 percent of the 15,078 handles recorded and yet each handle accounts for fewer than 10 comments over the life of the blog. Quite a few of them if not most of them for sure are just spammers.  Many others are just hit and run commenters, responding to a provocative blog post that's gotten some traction in the wider blogosphere and elsewhere.

But this group also contains some of the most vicious, nasty and vindictive trolls of all.

This group includes the HNMT and members of The Hit Squad.

Late thought: the big lesson to draw is that if I concentrate on 1,748 handles and tag them, then I've in turn categorized over 94 percent of the comments. Pretty useful little report. And I just picked the brackets on a hunch.

Bit of an update: The 88 percent group of handles mentioned above also includes a lot of people making fun of the trolls by doing a variation on their handle. Goldy calls his comment threads "the cesspool". Sue me, I'm a part of it. But this 88 percent group of handles which accounts for less than 6 percent of all the comments I call "the swamp". There's some fun to be had there but only as long as you don't wade in for too long.

Saturday, April 3, 2010

Tagging Handles

Trolls and Heroes adopt many aliases over the course of their on-line existence. To separate them from one another and make searching and analysis easier I'm going to stamp each comment with a tag. No not all umpteen thousands of them - just associate the 15,078 handles (and growing) with a more coherent tag e.g. puddybud for all the versions of puddybud's handle and that will in turn stamp his legacy of over 26 thousand comments.

The tags will live in a separate file, a hash whose key is the tag and whose value is a category:

h - hero
t - troll
ul - a lefty who hurts the cause by getting too much like a righty.
rr - a right winger whose tone is reasonable most of the time. (Very rare at
s - spammer

In other news I'm torn a bit about using Sinatra over Rails for the QA portion of this project. Rails 3 is worth a look because it's new and much more flexible. I'll probably use both for portions of the remainder of this project.

After all most of the motivation and benefit of this project has been from learning and applying all these wonderful ruby-based software development frameworks.

Late update: Do you believe in synchronicity? I just found this out minutes before I published this post. The author of the command/tasking framework that I use to develop my reports on this project is heavily motivated by the concept of machine tagging which is used by Flickr and other projects. For now I'm sticking with the simple tagging implementation I've sketched here but I may have to revisit this more thoroughly later.

Thursday, April 1, 2010

MTR's blog

When bet-welshing troll Mark the Redneck arrived on the scene, he bragged he had a blog. I remembered him crowing about it and I had even read it once or twice but I forgot the url and could never find it again through google searches..

Well here comes our database yet again to the rescue. The blog probably went quiet shortly after MTR's boast but its ghost haunts our dim memory through the wayback machine:

For the most die hard troll aficionados only..