SQL New Blogger Challenge Weekly Digest
Watching all of the tweets as people posted their first entries in the SQL New Blogger Challenge earlier this week, I quickly realized that keeping up was going to be a challenge of its own. Fortunately, there are ways to reign it in.
My first stop was IFTTT (If This Then That). IFTTT allows you to create simple “recipes” to watch for specific events/conditions, then perform an action. They have over 175 “channels” to choose from, each of which has one or more triggers (events) and actions. I have IFTTT linked to both my Google and Twitter accounts, which allowed me to create a recipe which watches Twitter for the #sqlnewblogger hashtag, and writes any tweets that match it to a spreadsheet on my Google Drive account (I’ll make the spreadsheet public for now, why not?).
The next step is to export the spreadsheet to CSV. I don’t have this automated, and may not be able to (I may have to find another workaround). Once it’s a CSV, I can go to PowerShell to parse my data. I want the end result to be an HTML table showing each post’s author (with a link to their Twitter stream) and a link to the post (using the title of the post itself).
Once I import the CSV file into an object in my PowerShell script, I need to do some filtering. I don’t want to be collecting all the retweets (posts starting with RT
), and I should probably exclude any post that doesn’t contain a URL (looking for the string HTTP
).
To extract the post URLs, I ran a regular expression against each tweet. Twitter uses their own URL shortener (of course), which makes this pretty easy - I know the hostname is t.co
, and after the slash is an alphanumeric string. The regex to match this is fairly simple: [http|https]+://t.co/[a-zA-Z0-9]+
Then, for each URL found in the tweet, I use Invoke-WebRequest
to fetch the page. This cmdlet automatically follows any HTTP redirects (I was afraid I’d have to do this myself), so the object returned is the real destination page. Invoke-WebRequest
also returns the parsed HTML of the page (assuming you use the right HTTP method), so I can extract the title easily instead of having to parse the content myself. It’ll also give me the “final” URL (the destination reached after all the redirects). Easy!
My full script:
|
|
And now, my digest of the first week of the SQL New Blogger Challenge. This is not a complete listing because I didn’t think to set up the IFTTT recipe until after things started. I also gave people the benefit of the doubt on the timing (accounting for timezones, etc.) and included a few posted in the early hours of April 8th. For week 2, it will be more complete.
Limitations
There are a couple limitations and gotchas with this process:
- The IFTTT recipe only runs every 15 minutes (all IFTTT triggers run on 15 minute intervals) and only fetches 15 tweets each time it runs (again, IFTTT’s configuration). So if there’s a flood of tweets, they won’t all be captured.
- I don’t really check the final destination of a link. For example, one of the first tweets captured contained a link to another tweet, which then linked to a blog post. Could I detect this & resolve the true final destination? Probably. But it’d involve a lot more logic, and it’s getting late.
- I also don’t pick up every duplicate link/post. Again, I can probably get by this with some extra logic, but I don’t think it’s necessary right now.
- It doesn’t post automatically to my blog, or anywhere else. I have to manually paste the HTML into my blog post(s).
- I had to manually remove one link as it didn’t actually point to a post written for the challenge; it was tweeted with the hashtag, so my IFTTT recipe caught it.
- I started collecting these tweets mid-day April 7th. If you posted before that, I’ve likely missed your post. You will be picked up for Week Two!