Victor Farazdagi

Victor Farazdagi

Computer science and applied mathematics

15 Nov 2013

Weighted Interval Scheduling

Sample Problem

Some time ago when working on student and course management system for my client I needed to find out how many teaching periods are there withing a given date range.

Colleges and universities have different teaching periods types e.g. semesters, trimesters, quarters, summer terms of various length. So, with a date range given, what is the maximum study duration of non-overlapping terms within that range? That’s if our aim was to spend as much time at school as possible (consider we suddenly went crazy) what teaching periods should we consider enrolling in.

Here is sample set of available teaching periods:

Teaching Periods

Sounds like a simple quest! Take your time to figure it out what algorithm or probably heuristics would you employ?

Summer school in January??

I am sure that those who spotted that Summer School starts in January, already figured out what's going on. If not, then the explantion is trivial: I work for an Australian education provider, and there Summer is from December to February.

If you are new to the problem, chances that you would try some sort of heuristic in search for the greedy algorithm. For example you might opt for selecting whatever interval starts the earliest (Summer School in our case).

And if it wasn’t for weights (i.e. teaching periods have different durations and thus are weighted) then I bet you would definitely find such an algorithm. But it takes a bit of more involved process in case of weighted intervals.

The good news: this problem has an elegant and very efficient solution which is not hard to comprehend and implement.

Proper Problem Statement

While we definitely know what we are looking for, let’s start by redefining our problem more precisely. Informally, in Weighted Interval Scheduling each interval has a certain value (weight), and we want to select set of intervals of a maximum total weight.

So, suppose we have set of intervals 1,2,n1, 2, \dots n, let’s denote this set as II.

Arbitrary interval iIi \in I has a start and finish time as sis_i and fif_i respectively. Additionally each ii has a weight, denoted as wiw_i.

What we want to find is a subset OO (OIO \subseteq I) of non-overlapping (aka mutually compatible) intervals, with the maximum possible sum of intervals’ weights:

O=argmaxOI;i,jO;fisjfjsiiOwiO = \underset{O \subseteq I; \forall i, j \in O; f_i \le s_j \lor f_j \le s_i}{\mathrm{argmax}} \sum_{i \in O} w_i

Don’t panic, that formula is pretty much self explanatory:

  • Locate subset OIO \subseteq I
  • Any two intervals in OO, say ii and jj, must be non-overlapping. That’s why we require that either fjsif_j \le s_i or fisjf_i \le s_j is true.
  • Finally we are searching for the maximum sum of intervals weights iOwi\sum_{i \in O} w_i, hence the argmax.

Since we have a properly defined problem as well as precisely described expected outcome, it is a good time to start designing our algorithm.

Greedy algorithm

First let’s think of Weighted Interval Scheduling problem where each interval has an equal weight i.e. w1=w2==wnw_1 = w_2 = \dots = w_n.

Effectively what we will be looking for is to get subset OO having as much intervals in it as possible. Indeed, if intervals have equal length, then the more intervals we are able to select the higher is our total weight.

It is pretty clear that (unweighted) Interval Scheduling is nothing more but a special case of a more general Weighted Interval Scheduling.

If you look into our list of teaching periods you will probably notice that you can select four intervals, maximum:

  1. Summer School
  2. Trimester 1
  3. Trimester 2
  4. Either Trimester 3 or Semester 2

The greedy strategy that happens to produce optimal solution OO for unweighted intervals is pretty simple: accept the interval ii which has the minimum finishing time fif_i.

That’s it! You just grab interval ii which has the earliest fif_i, discard any intervals that overlap with ii, and continue until there’s no more intervals to pick from. Try it with our primitive example, you will immediately grasp what is going on.

Let me further illustrate this with code:

 1class Interval(object):
 2    '''Date interval'''
 3
 4    def __init__(self, title, start, finish):
 5        self.title = title
 6        self.start = int(time.mktime(datetime.strptime(start, "%d %b, %y").timetuple()))
 7        self.finish = int(time.mktime(datetime.strptime(finish, "%d %b, %y").timetuple()))
 8
 9    def __repr__(self):
10        return str((self.title, self.start, self.finish))
11
12def schedule_unweighted_intervals(I):
13    '''Use greedy algorithm to schedule unweighted intervals
14       sorting is O(n log n), selecting is O(n)
15       whole operation is dominated by O(n log n)
16    '''
17
18    I.sort(lambda x, y: x.finish - y.finish)  # f_1 <= f_2 <= .. <= f_n
19
20    O = []
21    finish = 0
22    for i in I:
23        if finish <= i.start:
24            finish = i.finish
25            O.append(i)
26
27    return O

If you need more code to look at, here is complete implementation of Interval Scheduling in Python.

Now that we have wrapped up constant weight interval scheduling, what happens to our amazing greedy strategy if intervals have different weights? It will fail miserably.

Just consider situation when we need to decide which one to pick as the last interval: Trimester 3 or Semester 2? If we follow the greedy method for selection by the smaller finishing time fif_i, then we will have to pick Trimester 3 (instead of more longer Semester 2).

So, while unweighted intervals are easy to schedule, we need to spend more time on weighted ones!

Recursive solution

Before we proceed I have to make one assumption: all intervals are sorted by finish times i.e. f1f2fnf_1 \le f_2 \le \dots \le f_n. That’s not too much to ask, as you can always sort your data in say O(nlogn)O(n\log n) time, therefore it will not add much to our final asymptotic complexity.

Now, consider two intervals ii and jj. Let’s assume that i<ji < j is to mean fisjf_i \le s_j i.e. interval ii comes before the interval jj. To make things more straightforward we will define a function p(j)p(j): rightmost interval ii, where i<ji < j and intervals ii and jj are mutually compatible.

Basically, p(j)p(j) will help us talk about the first interval to the left of jj which is not-overlapping with it. I know that rightmost non-overlapping interval to the left of the jj, might sound complex at first, but I am sure it will settle once you meditate on it a bit.

When locating that rightmost interval we can use binary search which has an O(logn)O(\log n) running time. Since we need to find p(1),p(2),,p(n)p(1), p(2), \dots ,p(n), overall complexity is bound by O(nlogn)O(n \log n).

When it comes to Python then bisect (yes internally it is implemented as a binary search) module provides a very succinct solution:

 1def calculate_previous_intervals(I):
 2    '''For every interval j, calculate the rightmost mutually compatible interval i, 
 3       where i < j I is a sorted list of Interval objects (sorted by finish time)
 4    '''
 5    # extract start and finish times
 6    start = [i.start for i in I]
 7    finish = [i.finish for i in I]
 8
 9    p = []
10    for j in xrange(len(I)):
11        i = bisect.bisect_right(finish, start[j]) - 1  # rightmost interval f_i <= s_j
12        p.append(i)
13
14    return p

I also posted on GitHub the complete implementation of Weighted Interval Selection in Python.

:fire: Attention! We come to the crucial point of our discussion, where we will finally get an insight into solving the Weighted Interval Scheduling problem.

Let’s think about optimal solution OO. This solution is what we are after, but we do not know what it looks like. However, even without knowing much about the actual solution we can infer that last interval nn is either part of optimal solution or not:

nOnOn \in O \lor n \notin O

Indeed, it is perfectly safe to assume such a dichotomy. And since it is a dichotomy, we can look into both possibilities and see if we can make any further inferences.

Case 1: nOn \in O

If nn is a part of optimal solution OO, then we can be sure that no interval between p(n)np(n) \dots n can be part of the OO (because all intervals within that range will be overlapping with nn by the definition of p(j)p(j)). So, if nOn \in O, we can discard intervals p(n)+1,p(n)+2,n1p(n) + 1, p(n) + 2, \dots n-1.

Another important conclusion is that if nOn \in O, then optimal solution OO must also include an optimal solution to the problem set 1p(n){1\dots p(n)}. Indeed, if OO is an optimal solution and nOn\in O, then OO can be divided into nn and optimal solution to a set of intervals between 1p(n)1\dots p(n) (we denote such a smaller solution as Op(n)O_{p(n)}).

Case 2: nOn \notin O

When nOn \notin O then our optimal solution is equal to the optimal solution to a smaller problem consisting of intervals 1n1{1\dots n-1}. Indeed, if nn is not a part of optimal solution, then we look for such a solution within a smaller set consisting of intervals 1n1{1\dots n-1}.

Finding optimal solution in set 1j{1\dots j}, OjO_j

After we reviewed both cases it is obvious that in order to find optimal solution OO (for nn elements) we have to look into smaller and smaller problems in the form of 1j{1\dots j}.

Let’s denote such optimal solution over intervals 1j{1\dots j} as OjO_j, and actual value (maximum sum of weights) as OPT(j)OPT(j). So, for any given jj we have the very same dichotomy: either jOjj \in O_j or jOjj \notin O_j, therefore we can have the crucial piece of our design:

OPT(j)=max(wj+OPT(p(j)),OPT(j1))OPT(j) = max(w_j + OPT(p(j)), OPT(j-1))

That is value of optimal solution OjO_j is either wj+OPT(p(n))w_j + OPT(p(n)) or OPT(n1)OPT(n-1), whichever is bigger.

And from this follows another related inference: some arbitrary interval jj is a part of optimal solution OjO_j iff:

wj+OPT(p(j))OPT(j1)w_j + OPT(p(j)) \ge OPT(j-1)

At this point we have everything necessary to solve the original problem. Indeed we have all we need to compute optimal solutions for any interval 1n1\dots n and by definition of our problem statement we are after the OnO_n solution:

1def compute_opt(j):
2    if j == 0:
3        return 0
4
5    return max(w[j] + compute_opt(p[j]), compute_opt(j - 1))

Since we recursively computing optimal solutions to smaller and smaller subsets, we recalculate the very same values over and over again, which results in exponential execution time i.e. unacceptable. The obvious solution is memoization to avoid recalculations, which will result in amazing O(n)O(n) runtime.

Now say we have calculated all our optimal solutions values: OPT[1],OPT[2],OPT(n)OPT[1], OPT[2], \dots OPT(n). What if we need (and we generally do) to have actual intervals gathered? Easy:

 1# given OPT and p, find actual solution intervals in O(n)
 2O = []
 3def compute_solution(j):
 4    if j >= 0:  # will halt on OPT[-1]
 5        if I[j].weight + OPT[p[j]] > OPT[j - 1]:
 6            O.append(I[j])
 7            compute_solution(p[j])
 8        else:
 9            compute_solution(j - 1)
10compute_solution(len(I) - 1)

As you see, we have used the idea that interval jj is in optimal solution OjO_j iff:

wj+OPT(p(j))OPT(j1)w_j + OPT(p(j)) \ge OPT(j-1)

At this point we solved our original problem by designing adequate algorithm (more details on how good is our algorithm are in Algorithm Analysis section).

Dynamic programming

The pith of our algorithm is obtaining the OPT(1)OPT(n)OPT(1)\dots OPT(n) list. Once we got it, everything else was trivial to unfold.

So let’s scrutinize that OPTOPT list, and see if we need a recursive solution to calculate it.

Given the formula

OPT(j)=max(wj+OPT(p(j)),OPT(j1)OPT(j) = max(w_j + OPT(p(j)), OPT(j-1)

we know that any given value of OPTOPT depends on previous values: either OPT(p(j))OPT(p(j)) or OPT(j1))OPT(j-1))

So what if we start from the first interval and traverse to the nnth, can we compute OPT(j)OPT(j) values iteratively? We can! For any given jj to have access to previous OPTOPT values, all we need to do is to persist already calculated items as we traverse the set II:

1# compute OPTs iteratively in O(n), here we use DP
2OPT = collections.defaultdict(int)
3OPT[-1] = 0
4OPT[0] = 0
5for j in xrange(1, len(I)):
6    OPT[j] = max(I[j].weight + OPT[p[j]], OPT[j - 1])

And since we are iteratively building up subproblems to finally arrive to the one we are originally expected to solve OPT(n)OPT(n), we are having fun with dynamic programming (or DP for short).

In its essence DP involves dynamically building up sub-solutions, one after another, to finally solve the case in question.

I will write more about DP, not only because the technique is essential knowledge for any decent programmer, but also because solutions involving dynamic algorithms represent real elegance and beauty of the Computer Science.

Algorithm Analysis

It is time to briefly analyse our algorithms and their worst case complexities. Should you have any issues on Big-Oh analysis the best resource on the topic is Skiena’s “Algorithm Design Manual”.

Here is a list of steps we have to take and their respective time complexities:

1. Sort intervals set II, so that f1f2fnf_1 \le f_2 \le \dots \le f_n

If we use something like Quicksort or Mergesort and get away with O(nlogn)O(n \log n) (on average in case of Quicksort and as a worst case for the Mergesort).

2. Calculate p(1),p(2),,p(n)p(1), p(2), \dots ,p(n)

We iterate over nn different items, and for each item we need to find the first non-overlapping interval to the left of it. Suppose we use binary search to locate that non-overlapping neighbor interval (which has O(logn)O(\log n) complexity).

Then, since we have two growing functions f(n)=nf(n) = n and g(n)=logng(n) = \log n, the total time complexity will be their product:

O(f(n))O(g(n))=O(f(n)g(n))=O(nlogn)O(f(n)) * O(g(n)) = O(f(n) * g(n)) = O(n \log n)

3. Calculate OPT(1),OPT(2),,OPT(n)OPT(1), OPT(2), \dots ,OPT(n)

Both memoized and iterative versions of function yield O(n)O(n).

4. Calculate solution (given OPTOPT list)

Since we already have OPTOPT values calculated, recursively traversing them is an O(n)O(n) operation.

5. Overall performance

When adding Big-Oh complexities the dominant function defines the whole relationship. In our case it is

O(nlogn)sort intervals+O(nlogn)calculate p[1..n]+O(n)calculate OPT[1..n]+O(n)find solution intervals=O(nlogn) O(n \log n)_{\text{sort intervals}} + O(n \log n)_{\text{calculate p[1..n]}} + O(n)_{\text{calculate OPT[1..n]}} + O(n)_{\text{find solution intervals}} = O(n \log n)

Not bad, I believe!

Summary

That’s it for today, so let me conclude this article by quick summary:

  • we briefly visited special case of unweighted intervals
  • algorithm based on greedy selection proved not good enough to solve more general case of Weighted Interval Scheduling
  • we then developed a recursive function to calculate OPT[1..n]OPT[1..n]
  • and finally experienced example of DP in iterative version of OPTOPT calculation
  • when we have an array of OPTOPT values, it is trivial to produce a list of actual intervals that are elements of OnO_n.

Have a nice day!

NB: I created a repository with complete implementations of both weighted and unweighted problem variants. Feel free to fork and send me PR should you locate some issues with the code!

References