I recently needed to code some raw SQL into the strength training app I'm building (Strengthify) as Active Record's ORM query interface was not sufficiently robust for my use case (building a modest analytics interface).

Using the AR query interface is great for simple select queries, and it makes grouping, ordering, and specifying filter conditions quite easy and concise. It's a good abstraction from the SQL syntax, but it starts getting in the way when you need to do more complex queries with multiple joins and lesser-used SQL functions like generate_series, window functions, array_agg, etc.

In building Strengthify, one of the main metrics I'm using to measure training progress is Load Volume. Load Volume is a good approximation of the intensity of a strength workout, and appears to be an accurate measurement of a person's strength ( PubMed Article ). It's calculated by taking the product of the number of sets, the number of exercises, the weight lifted per rep, and the total number of reps for all exercises. For example:

4 sets x 130lbs Bench Press x 15 reps/set = 7,800 load-volume

Now, you might be thinking, load-volume only approximates the mechanical work done during a workout: it takes no account of the time taken to complete a workout and thus does not measure the intensity value and has no indication on the quality of a workout. In order to account for the intensity of a workout, we can divide load-volume by rep-volume (the total number of reps in a workout), a value that is proportional to time. Using the example above:

7,800 load-volume / 60 rep-volume = 130 (avg. weight lifted per rep)

This metric is fine if rest periods are kept constant, but changing rest periods will throw this off. Anyways, that's for another article perhaps.

Getting to the point of this post, I needed to get a way to retrieve a user's load-volume for each day for the past X number of days. The challenge here was that user's did not workout every single day of the week, so I would have an ugly empty space on the days where the user didn't work out:

The first stab at my analytics graph

The first stab at my analytics graph

This is where Postgres's generate_series came to the rescue. Say you want to select each day for the past three weeks. Using generate_series and a sub-query in the FROM clause, this is possible:

SELECT series.days FROM ( SELECT generate_series(-21, 0) + current_date as days ) series;

    days    
------------
 2013-05-11
 2013-05-12
 2013-05-13
 2013-05-14
 2013-05-15
 2013-05-16
 2013-05-17
 2013-05-18
 2013-05-19
 2013-05-20
 2013-05-21
 2013-05-22
 2013-05-23
 2013-05-24
 2013-05-25
 2013-05-26
 2013-05-27
 2013-05-28
 2013-05-29
 2013-05-30
 2013-05-31
 2013-06-01
(22 rows)

Now we can get the total load-volume for each day by performing a left join of the workouts table to the series above, using the SUM() aggregate function on the load_volume column, and grouping by the series:

SELECT series.days, sum(COALESCE(load_volume, 0)) as total_load_volume 
FROM ( SELECT generate_series(-21, 0) + current_date as days ) series LEFT OUTER JOIN                                                       
 workouts on series.days = workouts.created_at::date AND 
 workouts.user_id = 9 AND                                                                                         
 workouts.category IN ('Full Body','Lower Body','Upper Body','Legs','Hips',
'Shoulders','Arms','Chest','Back') AND 
 (workouts.created_at BETWEEN '2013-05-10 04:00:00.000000' AND '2013-06-01 04:00:00.000000') 
 GROUP BY 1 ORDER BY 1;

    days    | total_load_volume 
------------+-------------------
 2013-05-11 |                 0
 2013-05-12 |                 0
 2013-05-13 |                 0
 2013-05-14 |                 0
 2013-05-15 |                 0
 2013-05-16 |                 0
 2013-05-17 |               600
 2013-05-18 |                 0
 2013-05-19 |                 0
 2013-05-20 |                 0
 2013-05-21 |                 0
 2013-05-22 |              1950
 2013-05-23 |                 0
 2013-05-24 |                 0
 2013-05-25 |                 0
 2013-05-26 |                 0
 2013-05-27 |                 0
 2013-05-28 |                 0
 2013-05-29 |                 0
 2013-05-30 |               800
 2013-05-31 |                 0
 2013-06-01 |                 0
(22 rows)

We can see there were three workouts done in the past three weeks (this is fake data, so don't be quick to judge!). But what if a user had done more than one workout on a specific day? That wouldn't be evident because in using SUM(), we lost the individual details for the workout ids and the load-volume values. I'll modify the data now so that on '2013-05-30' there were two workouts completed. Using array_agg on both the id and load_volume columns of the workout table, we can summon the lost data from the abyss and the two workouts will be clearly evident:

SELECT 
 series.day,
 array_agg(id) as ids, 
 array_agg(load_volume) as load_volumes, 
 sum(COALESCE(load_volume, 0)) as total_load_volume 
 FROM (
 SELECT generate_series(-22, 0) + current_date as day
 ) series 
 LEFT OUTER JOIN 
 workouts on series.day = workouts.created_at::date AND 
 workouts.user_id = 9 AND 
 workouts.category IN ('Full Body','Lower Body','Upper Body','Legs','Hips',
'Shoulders','Arms','Chest','Back') AND 
 (workouts.created_at BETWEEN '2013-05-10 04:00:00.000000' AND '2013-06-01 04:00:00.000000') 
 GROUP BY 1 ORDER BY 1;

    day     |   ids   | load_volumes | total_load_volume 
------------+---------+--------------+-------------------
 2013-05-10 | {NULL}  | {NULL}       |                 0
 2013-05-11 | {NULL}  | {NULL}       |                 0
 2013-05-12 | {NULL}  | {NULL}       |                 0
 2013-05-13 | {NULL}  | {NULL}       |                 0
 2013-05-14 | {NULL}  | {NULL}       |                 0
 2013-05-15 | {NULL}  | {NULL}       |                 0
 2013-05-16 | {NULL}  | {NULL}       |                 0
 2013-05-17 | {60}    | {600}        |               600
 2013-05-18 | {NULL}  | {NULL}       |                 0
 2013-05-19 | {NULL}  | {NULL}       |                 0
 2013-05-20 | {NULL}  | {NULL}       |                 0
 2013-05-21 | {NULL}  | {NULL}       |                 0
 2013-05-22 | {59}    | {1950}       |              1950
 2013-05-23 | {NULL}  | {NULL}       |                 0
 2013-05-24 | {NULL}  | {NULL}       |                 0
 2013-05-25 | {NULL}  | {NULL}       |                 0
 2013-05-26 | {NULL}  | {NULL}       |                 0
 2013-05-27 | {NULL}  | {NULL}       |                 0
 2013-05-28 | {NULL}  | {NULL}       |                 0
 2013-05-29 | {NULL}  | {NULL}       |                 0
 2013-05-30 | {61,58} | {300,800}    |              1100
 2013-05-31 | {NULL}  | {NULL}       |                 0
 2013-06-01 | {NULL}  | {NULL}       |                 0
(23 rows)

Finally, below is what it actually looks like using Rails' find_by_sql method, which allows you to write raw SQL. It also shows the local SQL parameter's I'm passing in to the method, which allows all of these values (the series_length, the categories, etc.) to change dynamically based on the users' preferences.

Workout.find_by_sql(["
 SELECT 
 series.day,
 array_agg(id) as ids, 
 array_agg(load_volume) as load_volumes, 
 sum(COALESCE(load_volume, 0)) as total_load_volume 
 FROM (
 SELECT generate_series(-22, 0) + current_date as day
 ) series 
 LEFT OUTER JOIN 
 workouts on series.day = workouts.created_at::date AND 
 workouts.user_id = 9 AND 
 workouts.category IN ('Full Body','Lower Body','Upper Body','Legs','Hips',
'Shoulders','Arms','Chest','Back') AND 
 (workouts.created_at BETWEEN '2013-05-10 04:00:00.000000' AND '2013-06-01 04:00:00.000000') 
 GROUP BY 1 ORDER BY 1;
", time_period, series_length, @user.id, cat_param, time_window, Time.now.midnight, time_period
 ])

And the result, which I think looks much clearer:

The 'final' graph

The 'final' graph

Pretty cool, right? Got any comments or suggestions? Is this the way you would do it, or would you opt for sticking with AR's query interface? Let me know what you think on Hacker News.