The frequent items problem is to identify the items that appear most often in a stream. For example, we may want to find the most popular products on Amazon, the most watched videos on YouTube, or the most searched queries on Google. We process each item as it appears in the stream and, at any moment, our goal is to return the most frequent items without having to scan through a database.
The obvious algorithm for the frequent items problem is to store each item and the number of times we’ve seen it. The issue is that we would then need space linear in the number of unique items. If the items are pairs of products, for example, then the space scales quadratically. If the items are triplets of videos, then the space scales cubically. Clearly, we need a more efficient algorithm. It turns out that we can’t solve the problem exactly with less space but we can solve the problem approximately.
Consider a stream of \(n\) items \(x_1, \ldots, x_n\). We’ll call the set of all items in the stream \(U\). Let \(k\) be a positive integer and \(\epsilon > 0\) be a small constant. The frequent items estimation problem is to return
every item that appears at least \(\frac{n}{k}\) times and
only items that appear at least \((1-\epsilon)\frac{n}{k}\) times.
We’ll see how to use a randomized hashing algorithm to solve the problem. The algorithm addresses the slightly different problem of estimating the frequency of any item \(v\) in the stream. Formally, let \(f(v) = \sum_{i=1}^n \mathbb{1}[x_i=v]\) be the number of times item \(v\) appears in the stream. Here, we use the notation \(\mathbb{1}[x_i=v]\) to denote the indicator random variable that the \(i\)th item in the stream is \(v\). An indicator random variable is 1 if the event happens and 0 otherwise.
Our goal is to return an estimate \(\hat{f}(v)\) so that \(f(v) \leq \hat{f}(v) \leq f(v) + \frac{\epsilon}{k} n\) with high probability. If we have these estimates, observe that we can can solve the frequent items estimation problem simply by returning all items for which \(\hat{f}(v) \geq \frac{n}{k}\).
The key ingredient of the algorithm is hash functions.
Let the hash function \(h\) be a random function from the set of all items \(\mathcal{U}\) to the integers \(\{1,\ldots, m\}\). The hash function is constructed using a seed of random numbers and then the function is fixed. Given an item \(x\), the hash function always returns the same hashed output \(h(x)\).
Definition: A hash function \(h: \mathcal{U} \to \{1, \ldots, m\}\) is uniform random if
\(\Pr(h(x)=i) = \frac{1}{m}\) for all items \(x\) and \(i \in \{1, \ldots, m\}\) and
\(h(x)\) and \(h(y)\) are independent random variables for all \(x,y \in U\).
Notice that the independence condition implies \(\Pr(h(x)=h(y)) = \frac{1}{m}\).
In general, it is not possible to efficiently implement uniform random hash functions. But, for our application, we only need a universal hash function which can be implemented efficiently. Let \(p\) be a prime number between \(|\mathcal{U}|\) and \(2|\mathcal{U}|\). Choose \(a,b\) randomly from \(0,\ldots,p\) so that \(a \neq 0\), then define the hash function \[h(x) = (a x + b \mod p) \mod m.\] Check out these lecture notes to learn why \(\Pr(h(x) = h(y)) \leq \frac{1}{m}\) for this hash function. (As we’ll soon see, this is the condition we need for the algorithm.)
With handy hash functions, we’re now ready to describe the algorithm. We first choose a random hash function \(h\) and initialize an \(m\)-length array with \(0\) in every cell. Items arrive in a stream and we hash each one to one of the \(m\) cells in the array. We then add 1 to the number in the cell. Formally, given item \(x_i\), we set \(A[h(x_i)] = A[h(x_i)]+1\).
In the figure, Amazon products appear one after the other in a stream. We hash the fish tank to the second cell in the array, the basketball to the first cell, and so on. Crucially, when we see the fish tank again, we hash it to the same cell.
After processing the stream, our estimate for the frequency of item \(v\) is \(\hat{f}(v) = A[h(v)]\). Notice that \(\hat{f}(v) \geq f(v)\) since every time we saw item \(v\), we added 1 to the corresponding cell in the array. The inequality appears because we could have overcounted when other items hash to the same cell.
Formally, our estimate for the frequency is \[ \hat{f}(v) = f(v) + \sum_{y \in \mathcal{U} \setminus v} f(y) \mathbb{1}[h(y)=h(v)]. \] Our estimate for the frequency contains the true frequency plus an error term for all the other items that are hashed to the same cell.
Let’s take the expectation of the error term: \[ \mathbb{E}\left[\sum_{y \in U \setminus v} f(y) \mathbb{1}[h(y)=h(v)]\right] = \sum_{y \in \mathcal{U} \setminus v} f(y) \mathbb{E}[\mathbb{1}[h(y)=h(v)]] \] \[ \leq \sum_{y \in \mathcal{U} \setminus v} f(y) \frac{1}{m} = \frac{1}{m} \sum_{y \in \mathcal{U} \setminus v} f(y) \leq \frac{n}{m}. \] We used linearity of expectation, the special property of indicator random variables and the probability of collisions in our hash function in the first inequality. We used that the sum of frequencies has to sum to \(n\) in the last equality.
We have a non-negative random variable and a bound on its expectation, let’s apply Markov’s inequality to the error term
\[ \Pr\left( \sum_{y \in \mathcal{U} \setminus v} f(y) \mathbb{1}[h(y)=h(v)] \geq \frac{2n}{m} \right) \leq \frac{n/m}{2n/m} = \frac{1}{2}. \] So we’ve shown that our estimate \(A[h(v)]\) satisfies
\[ f(v) \leq A[h(v)] \leq f(v) + \frac{2n}{m} \] with probability \(\frac{1}{2}\) for any \(v\). If we set \(m = \frac{2k}{\epsilon}\), we can solve the problem with error \(\frac{\epsilon n}{k}\). But our success probability is a little low, how can we improve it?
A common approach in many randomized algorithms is to boost the success probability by repeating the core subroutine. In our case, we’ll maintain \(t\) independent hash functions and arrays.
As depicted in the figure, each item gets hashed into one cell in every array using the respective hash function. So the update on item \(x_i\) for every array \(j \in \{1, \ldots, t\}\) is \(A_j[h_j(x_i)] = A_j[h_j(x_i)] + 1\).
Then, when we’re computing our estimate for the frequnecy of an item, we look at each cell it appears in and take the minimum. Formally, \(\hat{f}(v) = \min A_j[h_j(v)]\). We take the minimum because each array only has one-sided error; its estimate for the frequency is never smaller than the true frequency. If our estimates instead had two-sided error, we would prefer to take the mean or median.
For every array \(j\) and item \(v\) when we set \(m=\frac{2k}{\epsilon}\), we know that \(f(v) \leq A_j[h_j(v)] \leq f(v) + \frac{\epsilon n}{k}\) with probability \(\frac{1}{2}\). Let’s compute the probability that our final estimate has this much error.
\[ \Pr \left( \hat{f}(v) > f(v) + \frac{\epsilon n}{k} \right) \leq \left( \frac{1}{2} \right)^t \] Call the event that the frequency estimate has error more than \(\frac{\epsilon n}{k}\) an error. The inequality holds since our final estimate fails only if every single one of the arrays also independently fails. If we set \(t=\log_2{1/\delta}\) for some small constant \(\delta > 0\), then the failure probability is \(\delta\). The success probability, the complement of the failure probability, is \(1-\delta\).
Putting it all together, we just proved that count-min sketch lets us estimate the frequency of each item in the stream up to error \(\frac{\epsilon n}{k}\) with probability at least \(1-\delta\) in \(O(mt) = O(\log(1/\delta) \frac{k}{\epsilon})\) space. The first term in the space complexity comes from the number of arrays \(t\) and the second term comes from the space per array \(m\).
However, this guarantee is only for a single item \(v\). We really want a guarantee for all items.
Another simple and powerful tool in randomized algorithm design is the union bound. The union bound allows us to easily analyze the complicated dynamics of many random events.
Lemma: For any random events \(E_1, E_2, \ldots, E_k\),
\[ \Pr(E_1 \cup E_2 \cup \ldots \cup E_k) \leq \Pr(E_1) + \Pr(E_2) + \ldots + \Pr(E_k). \]
Proof: We’ll give a “proof by picture”. Each circle represents the outcome space corresponding to event \(E_i\). The total outcome space covered by any overlapping events is at most the outcome space covered by the events if they were to not overlap.
Let’s apply the union bound to the total failure probability of our estimate. The algorithm fails if \(\hat{f}(v_i) > \frac{\epsilon n }{k}\) for any item \(v_i\). By the union bound, \[ \Pr \left( \textrm{fail for } v_1 \cup \textrm{fail for } v_2 \cup \ldots \cup \textrm{fail for } v_{|\mathcal{U}|} \right) \] \[ \leq \Pr(\textrm{fail for } v_1) + \Pr(\textrm{fail for } v_2) + \ldots + \Pr(\textrm{fail for } v_{|\mathcal{U}|}) \] \[ = \delta + \delta + \ldots + \delta = |\mathcal{U}| \delta \leq n \delta. \] Let’s set \(\delta=\frac{1}{10n}\). With probability \(9/10\), count-min sketch lets us estimate the frequency of all items in a stream up to error \(\frac{\epsilon n}{k}\). Recall this is is accurate enough to solve the frequent items estimation problem if we just return all items \(v\) with estimated frequency at least \(\frac{n}{k}\).