Setting Up Sidekiq With Cron Jobs

Setting up Sidekiq with Cron Jobs

For our final projects for the Flatiron School, we broke up into groups and developed web applications that related in some way to the Flatiron School. Groups made web applications to help potential students apply to Flatiron, a CRM program to help potential employers connect to students, a new and improved version of Piazza (the Blackboard-like class website we use). You can see all these projects and more at our Science Fair on Thursday, August 13.

My group worked on Teamline, a digital storytelling app that aggregate, filter, and display student-generated data sources. This included implementing multi-provider authorization using Omniauth; collecting and parsing user data from the GitHub, Twitter, and Feedzirra APIs, creating cron jobs to regularly pull data using the Whenever gem, and setting up asynchronous background processing to perform those jobs using Sidekiq and Redis.

That last one is what I want to talk about here. My classmate Matt Schmaus asked me for advice about how to implement regular scraping of high-volume data sources, and I suggested that he take the same approach I did (described below).

My challenge here was to regularly pull data from Twitter and perform it in the background, so that’s why the Twitter domain comes into play in this code, and why I chose to use cron jobs with background processing.

Creating Cron Jobs Using Whenever

First I needed to set up a regular time for my site to go and check whether there were new tweets by students whose feeds we were parsing. To do this, I set up a cron job. Learn more about these in [Railscast #164] (http://railscasts.com/episodes/164-cron-in-ruby).

Cron jobs generally look something like this:

1 0 * * * printf > /var/log/apache/error_log

0 */2 * * * /home/username/test.pl

or in other words, clear the error log at 12:01AM each day and run the test every 2 hours (thank you Wikipedia).

Luckily in Ruby, we have a gem that can do that for you. I used Whenever. Whenever provides a Ruby syntax for cron jobs, as a Ruby-like semantic clarity.

To Use Whenever:
* 1. Add gem ‘whenever’ to your Gemfile gem 'whenever', :require => false
* 2. Bundle
* 3. Cd into your app in the terminal and then use the command $ wheneverize . to create a schedule.rb file in your Config folder
* 4. Customize your schedule.rb file to perform the cron jobs you want

Here’s the one I used:

1
2
3
every 1.minute do
  rake "import_tweets:tweets"
end

Pretty syntactical, every 1 minute, do this rake task.

That brings us to my rake task.

Creating a Custom Rake Task That Calls A Sidekiq Worker

1
2
3
4
5
6
7
8
9
10
11
12
13
require 'twitter'
require_relative '../../app/workers/tweet_scrape_worker'
 
namespace :import_tweets do
 
  task :tweets => :environment do
 
    Student.all.each do |student|
      TweetScrapeWorker.perform_async(student.id)
      sleep 1
    end
  end
end

What I’ve done here is create my own custom rake task, which is actually pretty easy. Beginner Rails developers are familiar with rake tasks such as rake db:migrate. You can set up your own namespaces (db in this case) and tasks (migrate in this case). You namespace your task, as I’ve done when writing namespace :import_tweets do, and then you create a task by writing inside that do task :tweets => :environment do. Then, for this specific use case, I’ve gone through each student in our Student class, and on each, called the TweetScrapeWorker (more on that in a moment) to “perform_async” on that student, passing in the student’s id. You’ll see that I’ve also included sleep 1.

So what does this all mean?

  • Starting with the most last part of this code, I’ve included sleep 1 as a way to create a pause in between the Rake tasks, so we don’t face Twitter rate limits when polling the site.
  • import_tweets and task: tweets is pretty straightforward. It’s just a way for me to call $ rake import_tweets:tweets in my terminal to perform this task on my app
  • TweetScrapeWorker.perform_async(student.id) – this is a bit more complex. It leads me to my next tool, Sidekiq…

Sidekiq

Sidekiq is efficient background processing for Ruby. Learn more about it in Railscast #366. There’s several steps involved in setting up Sidekiq, and I would highly recommend watching the Railscast over and over again until you understand the concept and execution. One note for beginners is not to get too lost (right now) in the last part of this episode, where Ryan Bates talks about multithreading, concurrency, and thread safety. This is important and iteresting stuff, but perhaps not your top priority when you’re just starting out. Also note that I was also able to set this up without using Celluloid, which provides a Ruby-esque object-oriented way of creating concurrency in Ruby.

So, going back to my code, the first thing you see after the rake task is set up is that I’ve called a method called “perform_async” on a “TweetScrapeWorker”. In Sidekiq, you set up workers to do different jobs for you, and then send those jobs to a queue, so that when the workers are not busy, they can do your job as part of this background process. So, after including the ‘Sidekiq’ gem and the dependencies described in the Railscast, you’ll also want to create a folder in your app directory called “workers”, where you can save your workers. I’ll go through that model in a moment. You call “perform_async” (a Sidekiq convention) to send your jobs to the Sidekiq queue.

You’ll also note that I’ve passed in “student.id”. This relates to the server I’m using to do the background jobs, which is Redis. Redis is a database server. The way it is structured means that you should pass in ‘serialized’ data like the id of a student object, versus the whole object. To install Redis, use brew install redis and then start it up in your terminal using this command: $ redis-server /usr/local/etc/redis.conf.

That sums up the code I’ve written in my rake task, so now let’s go to what’s being called in my worker class.

Worker Class

1
2
3
4
5
6
7
8
class TweetScrapeWorker
  include Sidekiq::Worker
  
  def perform(student_id)
    student = Student.find(student_id)
    TweetScraper.new(student).scrape_feed
  end
end

I’ve created a camelcased class that ends with “worker” to follow Sidekiq worker class conventions. Then I’ve included the Sidekiq::Worker module, which allows this class to inherit the Sidekiq functionality. As part of this convention, I’ve created a perform method (which Sidekiq looks for), passed in the student id, and then called the TweetScraper model I’ve set up in my models directory to do my tweet scraping, creating a new instance and passing in a student, and then calling the scrape feed method on that instance. All you really need to understand here as a beginner are what are the conventions you inherit from Sidekiq.

My Tweet Scraper Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class TweetScraper
attr_accessor :student
 
  def initialize(student)
    @student = student
  end
 
  def scrape_feed
      Twitter.user_timeline(student.twitter_handle, options = {:count => 200}).each do |tweet|
 
        if Tweet.where(:tweet_id => tweet.id).empty?
          tweet_post = Tweet.new
          tweet_post.student_id = student.id
          tweet_post.tweet_id = tweet.id
          tweet_post.tweet_content = tweet.text
          tweet_post.tweet_published_at = tweet.created_at    
          tweet_post.profile_image_url = tweet.profile_image_url
          tweet_post.save!
 
        end
      end
   end
end

This is the easy part. I make calls to the Twitter API (which I’ve authenticated with using Omniauth elsewhere) using the Twitter gem to get tweets, and meta-content about them. One thing you might note if you are new to scraping is how I’ve set it up so that if a tweet has been scraped before, my program will recognize its unique id, and not scrape it each time. With high-volume scraping, this is pretty important. And don’t forget to tweet_post.save!

Last thing
Remember that when running your server in development phase, you should also run the redis server and sidekiq as well as rails server. You can also check out your sidekiq activity and whether everything is working at the ‘/sidekiq’ route after you set up the route in your routes.rb file like so mount Sidekiq::Web, at:'/sidekiq'.

The End


And in this way, you can create cron jobs easily in Ruby using whenever, and perform them regularly in the background using Sidekiq with Redis. Happy scraping Matt (and others)! Please post in the comments if you have any questions or if there’s anything I can clear up.

Comments