I've spent the last few weeks reading papers on data stream
processing. (Data stream processing is a really facinating topic.
The basic idea is that you have constantly arriving new data and you
want to continuously execute queries, e.g., you want to know the
average of the last N
elements. The difficulty is that you don't
have enough space to store the whole stream or even the last N
elements, or you don't have the time required to scan the last N
.
As such, you need an algorithm that not only incrementally updates the
state to include new elements as they arrive, but also somehow expires
old elements as they eclipse the event horizon! See "STREAM: The
Stanford Data Stream Management System" by Arasu et
al. for a good overview of the problem.)
This blog post is not about stream processing, however. This blog post is about an algorithm, which I happened to come across it while researching that data stream processing. The algorithm is able to find the element that appears more than 50% of the time in an array, if there is one. That's not so amazing by itself. The amazing part is that it has an O(N) run time and an O(1) memory requirement. Further, the algorithm is amazingly short, yet not at all obvious to me. The algorithm is described in a paper entitled "Finding Repeated Elements" by Misra and Gries.
The algorithm consists of two steps. First, it identifies a candidate value. The candidate value either occurs >50% of the time or there is no element in the array that occurs >50% of the time. The second step verifies whether the candidate element does indeed occur >50% of the time. Here's the algorithm, implemented in C:
#define undefined (-1)
int
find_element_that_occurs_more_than_half_the_time (int array[], int n)
{
/* Step 1: Find the element that occurs >50% of the time, if
there is one. */
int c = 0;
int candidate = undefined;
for (int i = 0; i < n; i ++)
{
if (array[i] == candidate)
/* Match. */
c += 2;
else if (i == c)
/* New candidate. */
{
c += 2;
candidate = array[i];
}
}
/* Step 2: Verify that the candidate actually occurs >50% of the
time. */
for (i = 0, c = 0; i < n; i ++)
if (array[i] == candidate)
c ++;
if (c * 2 > n)
return candidate;
else
return undefined;
}
Does it work? Valdidating a candidate is easy. My problem was
understanding that selecting the candidate really works. I found it
best to work through some examples. Let's consider one example, the
array: [ 0, 1, 2, 2, 0, 0, 0 ]
, and what happens at each
iteration of the the first loop:
Initial state: i = 0, c = 0, candidate = undefined
array[i = 0] = 0 => New candidate => c = 2, candidate = 0
array[i = 1] = 1 => Do nothing => c = 2, candidate = 0
array[i = 2] = 2 => New candidate => c = 4, candidate = 2
array[i = 3] = 2 => Match => c = 6, candidate = 2
array[i = 4] = 0 => Do nothing => c = 6, candidate = 2
array[i = 5] = 0 => Do nothing => c = 6, candidate = 2
array[i = 6] = 0 => New candidate => c = 8, candidate = 0
Yup, it works.
The main insight required to understand the algorithm is that when c
== i
, there is no element that has appeared more than 50% of the time
in the first i
elements so the only possibility is array[i]
.