Monday, April 26, 2010

Still Distracted but Turning a Big Corner

In the last post I was all jazzed about an opportunity to make the data collection from my "select sources" run much faster.

Well, I'm pretty much at the other end of it and I'm quite happy about it. I did a major rework of the code. It's much better than before and my only regret is that I wish I'd thought of this speedup sooner.

The old code base accomplished the collecting (almost 8,000 articles and 415,000 comments) but it did the job kind of slowly and that was even with some good speedups I had spotted early. But if I had had what I've got now, it would have gone much faster. Also I would have totally minimized data collection from HA itself. Oh well. Better late than never.

In the last post I mentioned that I used code that would automate a web browser to navigate the web and collect HA articles from select sources. I also mentioned that by bringing up one browser instance and feeding requests to it, you will maximize the data collection performance.

The obvious speedup of course is to distribute requests across more browser instances - make it scale. But now we're talking about forking off objects from classes that encapsulate the browser-driving code. And once you create sub-processes you need a means to communicate requests to and retrieve results from those forked instances.

The tried and true method to accomplish this communication is to use a message queuing daemon. In a previous Ruby project I'd used a pretty slick one with a handy Ruby support library. No problem mon -  it's just a matter of finding the right places to insert this code. It was a lot of work but it made me take a hard look at the entire code base and throw away a lot of stuff that just wouldn't be needed anymore. And as I separated out code into Ruby modules by function, I was forced to come to a better understanding of instance variables. In my Repo class I totally overused class methods and was passing way too many parameters from method to method. Why? Way back in the beginning I thought it would be easier to test if I did it that way. Totally wrong. Thanks to this rework, the code in this project and my other projects will be so much better.

One of my select sources in the past really made me tear my hair out. I had only collected 4 or so articles from it because of all the trouble it gave me but now that's all changed. I've gotten much more than a handful from that source for March 2010 alone and I expect it to only get better.  And there's yet another select source out there that I didn't even bother to try because it was so weird and different from the others. I'm feeling pretty confident now. I will give it a shot but I have to do a couple things first.

I'm going to start tagging handles in earnest. This is really critical to separating the trolls from the heroes and then being able to make big picture analyses of troll behavior over the months and the election cycles.

Then I've got to scratch an itch about current month activity. So far I'm caught up to the end of March 2010. Normally I wouldn't add a month to my collection until 10 days into the next month. Why? Goldy doesn't shutdown comments to an article until 8 days or so after the article is published. So if I'm curious about some interesting activity in the current month, I have to develop a way to create a snapshot database with an eye on merging various snapshots into the real thing to save time when a month finally closes.

No comments: