more-than-50-percent

I've spent the last few weeks reading papers on data stream processing. (Data stream processing is a really facinating topic. The basic idea is that you have constantly arriving new data and you want to continuously execute queries, e.g., you want to know the average of the last N elements. The difficulty is that you don't have enough space to store the whole stream or even the last N elements, or you don't have the time required to scan the last N. As such, you need an algorithm that not only incrementally updates the state to include new elements as they arrive, but also somehow expires old elements as they eclipse the event horizon! See "STREAM: The Stanford Data Stream Management System" by Arasu et al. for a good overview of the problem.)

This blog post is not about stream processing, however. This blog post is about an algorithm, which I happened to come across it while researching that data stream processing. The algorithm is able to find the element that appears more than 50% of the time in an array, if there is one. That's not so amazing by itself. The amazing part is that it has an O(N) run time and an O(1) memory requirement. Further, the algorithm is amazingly short, yet not at all obvious to me. The algorithm is described in a paper entitled "Finding Repeated Elements" by Misra and Gries.

The algorithm consists of two steps. First, it identifies a candidate value. The candidate value either occurs >50% of the time or there is no element in the array that occurs >50% of the time. The second step verifies whether the candidate element does indeed occur >50% of the time. Here's the algorithm, implemented in C:

  #define undefined (-1)
  int
  find_element_that_occurs_more_than_half_the_time (int array[], int n)
  {
    /* Step 1: Find the element that occurs >50% of the time, if
       there is one.  */
    int c = 0;
    int candidate = undefined;
    for (int i = 0; i < n; i ++)
      {
        if (array[i] == candidate)
          /* Match.  */
          c += 2;
        else if (i == c)
          /* New candidate.  */
          {
            c += 2;
            candidate = array[i];
          }
      }

    /* Step 2: Verify that the candidate actually occurs >50% of the
       time.  */
    for (i = 0, c = 0; i < n; i ++)
      if (array[i] == candidate)
        c ++;

    if (c * 2 > n)
      return candidate;
    else
      return undefined;
  }

Does it work? Valdidating a candidate is easy. My problem was understanding that selecting the candidate really works. I found it best to work through some examples. Let's consider one example, the array: [ 0, 1, 2, 2, 0, 0, 0 ], and what happens at each iteration of the the first loop:

  Initial state: i = 0, c = 0, candidate = undefined
  array[i = 0] = 0 => New candidate => c = 2, candidate = 0
  array[i = 1] = 1 => Do nothing => c = 2, candidate = 0
  array[i = 2] = 2 => New candidate => c = 4, candidate = 2
  array[i = 3] = 2 => Match => c = 6, candidate = 2
  array[i = 4] = 0 => Do nothing => c = 6, candidate = 2
  array[i = 5] = 0 => Do nothing => c = 6, candidate = 2
  array[i = 6] = 0 => New candidate => c = 8, candidate = 0

Yup, it works.

The main insight required to understand the algorithm is that when c == i, there is no element that has appeared more than 50% of the time in the first i elements so the only possibility is array[i].