Problem: find the median value of 2 sorted lists
I came across an intereseting algorithmic problem on a web site lately. Apparently it the kind of problems you are supposed to solve on a black board at Google interview. The formal statement is simple: How to efficiently find the median value of two sorted lists.
It's easy to have a feeling on which way to take. But to actually find an algorithm and implement it in a efficient and bug free way it's another story.
The solution I found is a relatively simple dichotomic search alorithm with a logaritmic complexity in the worst case.
Here is an exemple of how it works. Imagine we have these two sorted lists:
l1 = [1, 4, 5, 7, 7, 8, 9, 9] l2 = [0, 1, 3, 6, 6, 6, 7, 8]
We calculate the median of both lists:
l1 median = 7 l2 median = 6
l1 median is greater than l2 median so we can safely remove the numbers on the right side of the median of l1 because the final median is somewhere between the median of both. We also have to remove the same amount of numbers at the left side of l2 to keep the stabilty of the final list. If one of the list is shorter the amount of removed numbers will the minimum between both median index. So we get:
l1 = [1, 4, 5, 7, 7] l2 = [6, 6, 6, 7, 8]
We continue the process until have a list with 2 numbers (or less).
l1 median = 5 l2 median = 6
We remove 2 from left side of l1, we remove 2 from right side of l2.
l1 = [5, 7, 7] l2 = [6, 6, 6] l1 median = 7 l2 median = 6
We remove 1 from left side of l2, we remove 1 from right side of l1.
l1 = [5, 7] l2 = [6, 6]
Remember that we have 2 list of the same size. In another case one could be an order of magnitude bigger than the other. Finaly we have to deal with some special case:
- The 2 remaining numbers have to be inserted on the left side of the median of the other list.
- The 2 remaining numbers have to be inserted on the right side of the median of the other list.
- One have to be inserted on the left side of the median and the other on the right side.
For each case we have to check if the number could change the median of the resulting list. If we decide to insert l1 into l2, there will be no implication on the initial median value of l2. But if we decide to insert l2 into l1, the median of l1 will be completly changed. In both case we optain this list [5, 6, 6, 7] and calcuate the median of it : 6.
Inserting the two remaining value into the other list is not the fastest thing possible but it's simple and understandable.
Here is my python implementation:
from random import randint
def median(a, start=None, end=None):
"""find the median value of a sorted list or in a subset of it"""
if start is None:
start = 0
if end is None:
end = len(a)
l = end-start
if l < 1:
raise "median value of a empty list? It make no sense"
index = start + l/2
if l % 2 == 0:
return index, (a[index] + a[(index-1)]) / 2.0
else:
return index, float(a[index])
def median_2_lists(l1, l2):
"""find the median value of 2 sorted list"""
iter = 0
solution = None
left1 = 0
left2 = 0
right1 = len(l1)
right2 = len(l2)
print l1[left1:right1], l2[left2:right2]
while True:
iter = iter+1
index1, value1 = median(l1, left1, right1)
index2, value2 = median(l2, left2, right2)
minus1 = 1 if (right1-left1) % 2 == 0 else 0
minus2 = 1 if (right2-left2) % 2 == 0 else 0
remove_offset = max(1, min((right1-left1) /
2 - minus1, (right2-left2) / 2 - minus2))
# to avoid special blocking case like this these
# l1 = [-5,5]; l2 = [1,2]
# l1 = [5, 5]; l2 = [-11, -8, 20, 21]
# I decided to stop the algorithm and insert the
# 2 last items into the big list
if right1-left1 <= 2 or right2-left2 <= 2:
break
if value1 < value2:
left1 += remove_offset
print "removed of left side of l1 : %d" % remove_offset
right2 -= remove_offset
print "removed of right side of l2 : %d" % remove_offset
elif value1 > value2:
left2 += remove_offset
print "removed of left side of l2 : %d" % remove_offset
right1 -= remove_offset
print "removed of right side of l1 : %d" % remove_offset
else:
print "solution found"
solution = value1
break
print l1[left1:right1], l2[left2:right2]
if right1-left1 > right2-left2:
big = l1
short = l2
left = left1
right = right1
i = left2
end = right2
else:
big = l2
short = l1
left = left2
right = right2
i = left1
end = right1
if solution is None:
assert 0 < end-i < 3
# we need to insert the remaining 1 or 2 number(s) of the short list
# into the big list at the right place. It's done in a dichotomic way
while i < end:
l, r = left, right
while True:
index, value = median(big, l, r)
if value == short[i] or index == l or index == r:
if short[i] > value:
# it's bad to modify the big list
big.insert(index+1, short[i])
else:
big.insert(index, short[i])
i += 1
right += 1
break
else:
if short[i] > value:
l = index
if short[i] < value:
r = index
solution = median(big, left, right)[1]
return solution, iter
def test_median_2_lists():
print "Generating random lists"
l1 = []
l2 = []
i = 0
n = randint(5,9)
while i<n:
l2.append(randint(0,9))
i += 1
i = 0
n = randint(5,9)
while i<n:
l1.append(randint(0,9))
i += 1
l1.sort()
l2.sort()
merge = list(l1)
merge.extend(l2)
merge.sort()
control_value = median(merge)[1]
print "Start algorithm"
solution, iter = median_2_lists(l1, l2)
print "%s iteration(s) done" % iter
print "control value : %f" % control_value
print "solution value : % f" % solution
assert control_value == solution
print "Successful!"
test_median_2_lists()