Wednesday, April 7, 2010

Opportunities, Distractions

One goal of this project was to  minimize data collection as much as possible directly from HA.org itself. I would have been a lot farther along much sooner if I didn't care about this. Why is this important?

I not only want to slice and dice troll behavior but I also want to learn all these awesome cool ruby-based software libraries. They've been a total joy to learn about and work with.

The data collection problem and the composition of its solution into classes has been pretty simple so far. There's a Session class whose job is to collect one month of HA.org's existence. The most important instance method of this class is add which at its heart is a for loop which starts at the first article of the month and ends at the last. Each article of course contains the comments which is the gold I'm after. But it's within the loop that the fun really begins.

I said I wanted to minimize data collection from the well, HA.org. To do this I constructed a least cost route of sources from which to draw articles with HA.org being the highest cost, the last resort. So I have a Repo class (short for repository) that's blessed from a Session instance. I feed it an article number and a source and it tries to get the article. If it comes up short then the Session instance tries the next source in the route.

So far pretty simple. Not rocket science. Fun? Only for a geek, right? Well it gets even better. Some of my sources require calling a simple http get which is available from lots of ruby http libraries. This is how I get stuff, again as a last resort, from HA.org. Butt simple and too boring. But a few sources, I call them select sources, require something a little more sophisticated. They require driving a web browser to go and get the article. Without getting into too much detail:


Making this go was just too much freaking fun for words.

So, after fist pumping and watching my code drive a web browser to navigate the web and get the stuff I wanted from it I quickly realized I'd done something a little dumb:

My code was opening and closing the web browser for each request to the web.. Ugh.. It's a lot faster Sherlock to just leave one browser instance up and feed all the many requests you want to it and then close it when the session is concluded. Another hour or two of searching for the right spot in the code to insert the needed changes and I had a big increase in performance.

Which is all a long winded introduction to the point of this post - there's yet ANOTHER opportunity (and yep, distraction from my initial goal) to make this puppy run even faster. Next post. I'm gonna code.

No comments: