2008-04-25

Problem: find the median value of 2 sorted lists

I came across an intereseting algorithmic problem on a web site lately. Apparently it the kind of problems you are supposed to solve on a black board at Google interview. The formal statement is simple: How to efficiently find the median value of two sorted lists.

It's easy to have a feeling on which way to take. But to actually find an algorithm and implement it in a efficient and bug free way it's another story.

The solution I found is a relatively simple dichotomic search alorithm with a logaritmic complexity in the worst case.

Here is an exemple of how it works. Imagine we have these two sorted lists:

l1 = [1, 4, 5, 7, 7, 8, 9, 9] 
l2 = [0, 1, 3, 6, 6, 6, 7, 8]

We calculate the median of both lists:

l1 median = 7
l2 median = 6

l1 median is greater than l2 median so we can safely remove the numbers on the right side of the median of l1 because the final median is somewhere between the median of both. We also have to remove the same amount of numbers at the left side of l2 to keep the stabilty of the final list. If one of the list is shorter the amount of removed numbers will the minimum between both median index. So we get:

l1 = [1, 4, 5, 7, 7]
l2 = [6, 6, 6, 7, 8]

We continue the process until have a list with 2 numbers (or less).

l1 median = 5
l2 median = 6

We remove 2 from left side of l1, we remove 2 from right side of l2.

l1 = [5, 7, 7]
l2 = [6, 6, 6]

l1 median = 7
l2 median = 6

We remove 1 from left side of l2, we remove 1 from right side of l1.

l1 = [5, 7]
l2 = [6, 6]

Remember that we have 2 list of the same size. In another case one could be an order of magnitude bigger than the other. Finaly we have to deal with some special case:

  1. The 2 remaining numbers have to be inserted on the left side of the median of the other list.
  2. The 2 remaining numbers have to be inserted on the right side of the median of the other list.
  3. One have to be inserted on the left side of the median and the other on the right side.

For each case we have to check if the number could change the median of the resulting list. If we decide to insert l1 into l2, there will be no implication on the initial median value of l2. But if we decide to insert l2 into l1, the median of l1 will be completly changed. In both case we optain this list [5, 6, 6, 7] and calcuate the median of it : 6.

Inserting the two remaining value into the other list is not the fastest thing possible but it's simple and understandable.

Here is my python implementation:

download median.py

from random import randint

def median(a, start=None, end=None):
    """find the median value of a sorted list or in a subset of it"""
    if start is None:
        start = 0
    if end is None:
        end = len(a)
    l = end-start
    if l < 1:
        raise "median value of a empty list? It make no sense"
    index = start + l/2
    if l % 2 == 0:
        return index, (a[index] + a[(index-1)]) / 2.0
    else:
        return index, float(a[index])

def median_2_lists(l1, l2):
    """find the median value of 2 sorted list"""
    iter = 0
    solution = None
    left1 = 0
    left2 = 0
    right1 = len(l1)
    right2 = len(l2)
    
    print l1[left1:right1], l2[left2:right2]
    
    while True:
        iter = iter+1
        index1, value1 = median(l1, left1, right1)
        index2, value2 = median(l2, left2, right2)
        
        minus1 = 1 if (right1-left1) % 2 == 0 else 0
        minus2 = 1 if (right2-left2) % 2 == 0 else 0
        remove_offset = max(1, min((right1-left1) / 
            2 - minus1, (right2-left2) / 2 - minus2))
        
        # to avoid special blocking case like this these
        # l1 = [-5,5]; l2 = [1,2] 
        # l1 = [5, 5]; l2 = [-11, -8, 20, 21]
        # I decided to stop the algorithm and insert the 
        # 2 last items into the big list
        if right1-left1 <= 2 or right2-left2 <= 2:
            break
    
        if value1 < value2:
            left1 += remove_offset
            print "removed of left side of l1 : %d" % remove_offset
            right2 -= remove_offset
            print "removed of right side of l2 : %d" % remove_offset
        elif value1 > value2:
            left2 += remove_offset
            print "removed of left side of l2 : %d" % remove_offset
            right1 -= remove_offset
            print "removed of right side of l1 : %d" % remove_offset
        else:
            print "solution found"
            solution = value1
            break
        
        print l1[left1:right1], l2[left2:right2]
    
    if right1-left1 > right2-left2:
        big = l1
        short = l2
        left = left1
        right = right1
        i = left2
        end = right2
    else:
        big = l2
        short = l1
        left = left2
        right = right2
        i = left1
        end = right1
    
    if solution is None:
        assert 0 < end-i < 3
        # we need to insert the remaining 1 or 2 number(s) of the short list
        # into the big list at the right place. It's done in a dichotomic way
        while i < end:
            l, r = left, right
            while True:
                index, value = median(big, l, r)
                if value == short[i] or index == l or index == r:
                    if short[i] > value:
                        # it's bad to modify the big list
                        big.insert(index+1, short[i])
                    else:
                        big.insert(index, short[i])
                    i += 1
                    right += 1
                    break
                else:
                    if short[i] > value:
                        l = index
                    if short[i] < value:
                        r = index
        solution = median(big, left, right)[1]
        
    return solution, iter

def test_median_2_lists():
    print "Generating random lists"
    l1 = []
    l2 = []
    i = 0
    n = randint(5,9)
    while i<n:
        l2.append(randint(0,9))
        i += 1
    i = 0
    n = randint(5,9)
    while i<n:
        l1.append(randint(0,9))
        i += 1
    
    l1.sort()
    l2.sort()
    
    merge = list(l1)
    merge.extend(l2)
    merge.sort()
    
    control_value = median(merge)[1]
    
    print "Start algorithm"
    solution, iter = median_2_lists(l1, l2)
    print "%s iteration(s) done" % iter
    print "control value : %f" % control_value
    print "solution value : % f" % solution
    assert control_value == solution
    print "Successful!"

test_median_2_lists()