Ericpony's Blog: Designing algorithms with Spark aggregate

Thursday, February 12, 2015

Designing algorithms with Spark aggregate – Part I

In this post, I shall give a simple framework to design algorithms in terms of Spark's aggregate function. This framework is just a proof of concept and should not be applied blindly in practice. However, it does induce efficient algorithms in some special cases, as will be shown in the examples.

In general, our goal is to compute a property of any given instance from a domain. Let $G$ denote the domain of all interested instances, and $C$ the domain of the desired property. For example, if $G$ denotes graphs and the property to compute is MST, then $C$ will be the set of trees. Furthermore, we assume that there exists an associative and commutative "composition" operator $\oplus : G \rightarrow G \rightarrow G$, such that an instance of interests can be decomposed into (and composed back from) sub-instances in $G$. We shall denote the desired property of instance $g$ as $f(g)$. For example, if $g$ is a graph, then $f(g)$ can be some graph property such as diameter, MST and vertex cover. Note that the the mapping $f : G \rightarrow C$ can be non-deterministic if the property is not unique w.r.t. the given instance.

To use the aggregate function, we specify zero, seg and comb as follows:
$\mathrm{zero} \in (G, C)$ is a pair comprised of the zero values from $G$ and $C$. If there are no such zero values, we can set $\mathrm{zero} = (g, f(g))$ for an arbitrary sub-instance of $g$.
$\mathrm{seq} : (G, C) \rightarrow G \rightarrow (G, C)$ takes a partially aggregated result $(g_1, f(g_1))$ and a sub-instance $g_2$. It outputs $(g_3, f(g_1 \oplus g_2))$ such that $f(g_3 \oplus g) = f(g_1 \oplus g_2 \oplus g)$ for all $g \in G$.
$\mathrm{comb} : (G, C) \rightarrow (G, C) \rightarrow (G, C)$ takes two partially aggregated results $(g_1, f(g_1))$ and $(g_2, f(g_2))$. It outputs $(g_3, f(g_1 \oplus g_2))$ such that $f(g_3 \oplus g) = f(g_1 \oplus g_2 \oplus g)$ for all $g \in G$.
The rest of the steps are standard: we first decompose the input instance into a sequence of sub-instances. The sequence is then converted to an RDD through parallelize. Finally, we invoke RDD.aggregate(zero)(seq,comb) to compute the desired property of the instance.

Remarks

1. While functions seq and comb look very similar in spec, they can use different methods to compute $f(g_3)$. Due to the partial results provided, seq is suitable to use incremental methods and comb can use (the merge part of) divide-and-conquer methods to compute the desired property. See the convex hull example below.
2. The assumption that operator $\oplus$ is associative and commutative over $G$ can be relaxed to be so only with respect to the given property $f$. For instance, suppose that $G$ is a set of finite strings and $\oplus$ is the string concatenation operator. While $\oplus$ is not commutative per se, it is commutative w.r.t. the string length property. That is, $\mathrm{strlen}(g_1 \oplus g_2) = \mathrm{strlen}(g_2 \oplus g_1)$ for all $g_1, g_2 \in G$. The framework is thus applicable to compute the length of string. See the majority algorithm below for a more non-trivial example.
3. The framework is practical only if $|g_3|$ can be significantly less than $|g_1 \oplus g_2|$, for otherwise we will have to pass almost the whole instance from node to node. In the case that $f(g_1 \oplus g_2)$ can be computed from $f(g_1)$ and $f(g_2)$, there is no need to output $g_3$, i.e., $|g_3|=0$. This special case allows us to operate on simpler domains (e.g., the type of seq becomes $C \rightarrow G \rightarrow C$) and to obtain more efficient aggregate algorithms due to less communication overheads.

Examples

The above naive framework can yield efficient algorithms if $f(g_1 \oplus g_2)$ can be computed solely based on $f(g_1)$ and $f(g_2)$. Below we give three examples that satisfy this condition. The first is a variant of the reverse-delete algorithm that finds an MST of a graph. The second is an outline of a convex hull algorithm. The last is Boyer and Moore's algorithm that finds the majority element in a list.

Finding MST. Let $G$ be the set of all weighted undirected graphs, and $C$ be the set of trees. Given a graph $g$ from $G$, we assign a unique index to each of its edges. Let $w(e)$ and $idx(e)$ denote the weight and index of edge $e$, respectively. Given two edges $e_1$ and $e_2$ of $g$, we say $e_1 < e_2$ iff either $w(e_1) < w(e_2)$, or $w(e_1) = w(e_2)$ and $idx(e_1)$ < $idx(e_2)$. Now we define an operator msf over $G$, such that msf computes a minimum spanning forest of $g_1 \cup g_2$ from $g_1$ and $g_2$ as follows:

def msf (g1, g2) = {
  var g = g1 ∪ g2
  while (g contains a cycle c)
    remove the largest edge from c
  return g
}

Let RDD(g) denote the RDD obtained from decomposing and parallelizing graph $g$. Then RDD(g).aggregate(Nil)(msf,msf) computes a minimum spanning forest of $g$, which would be a MST if $g$ is connected. Note that the result is deterministic if the assignment of indices is deterministic.

Finding convex hull. Here $G$ is the collection of finite sets in Euclidean plane and $C$ is the set of convex polygons. The convex hull of a set of points is the smallest convex polygon containing all the points. If the points are given as a sequence, we can use aggregate to find their convex hull by computing smaller convex hulls incrementally in seq and then merging them in comb to obtain the final convex hull. Both of seq and comb can be implemented to take linear time (see the links for pseudo codes).

Finding majority. Let $C$ be a set of comparable elements. Fix some element $\bot \in C$ and let $G = (C - \{\bot\})^*$ be a set of finite sequences. The majority problem is to determine in a given sequence from $G$ whether there is an element with more occurrences than all the others, and if so, to determine this element. Click here to see a linear-time algorithm that solves the problem using aggregate. Note that the aggregate invocation only finds a candidate for majority; we need a further check to assert that the candidate is indeed the majority element.

// Input:  a list of elements and an element 
// Output: a boolean value indicating whether elem is the majority of list
def check_majority (list: G, elem: C): Boolean = {
  def zero = (elem, 0)
  def op ((a: C, c: Int), b: Int) = {
    if(a != b)
      return (a, c - 1)
    else
      return (a, c + 1)
  }
  val res = list.fold(zero)(op)
  return res._2 > 0
}
// Input:  a list of comparable elements
// Output: the majority element of the list, or ⊥ if there is no such element
def find_majority (list: G): C = {
  def zero = (⊥, 0)  
  def seq ((a: C, c: Int), b: Int) = { 
    if(c == 0) 
      return (b, 1)
    if(a == b)
      return (a, c + 1)
    else
      return (a, c - 1)
  }  
  def comb ((a1: C, c1: Int), (a2: C, c2: Int)) = {
    if(a1 == a2) 
      return (a1, c1 + c2)
    if(c1 > c2)
      return (a1, c1 - c2)
    else
      return (a2, c2 - c1)
  }
  val res = list.aggregate(zero)(seq, comb)

  if(check_majority(list, res._1))
    return res._1
  else
    return ⊥
}

Ericpony's Blog

Thursday, February 12, 2015

Designing algorithms with Spark aggregate – Part I

Remarks

Examples

No comments:

Post a Comment