Frequency Estimation in Streaming Data?

Streaming Data is data that is generated continuously by thousands or millions of data sources, such as:

  • Internet traffic data across the web or within data centers of large companies
  • Search queries entered by users around the world on search engines
  • Information from social networks, log files from mobile or web appliacations, etc

The goal of Frequency Estimation is to count the number of times an item appears in the stream.

  • Which internet links have the most traffic?
  • Which search phrases are trending now?
  • What are the trending topics on the social network?

However, in big data applications, the stream is too large (and may be infinite) and cannot be stored.
This challenge has motivated the development of streaming algorithms.


LearnedSketch

We propose a new class of learning-based algorithms that

  • learn relevant "structures" in the input data to improve its frequency estimates of future items (possibly unseen)
  • combine the benefits of machine learning with the formal guarantees from classical algorithms
  • provably have lower estimation errors than their non-learning counterparts
  • demonstrate their performance gains on two real-world datasets

We use neural networks as our ML models to learn structures in the data automatically.


Comparison to classical algorithms and baselines

Testing on future data stream not seen by the ML model:

  • Compared to their non-learning counterparts (Count Min, Count Sketch), our algorithms (Learned CM - NNet and Learned CS - NNet) reduce estimation errors by 18% to 71%.
  • Compared to memorizing the popular items with a lookup table, our model generalizes better to unseen items.
  • Our scheme can potentially achieve even better results with an ideal oracle (Learned CM - Ideal).

LCM
LCS

What are the structures learned by the neural networks?

We visualize the embedding space learned by our neural network on internet traffic data:

  • The model separates connections with heavy and light traffic in the embedding space (left figure).
  • The model learns to group connections with similar destination IP prefixes closer in the embedding space (right figure).
  • Learning this "structure" allows the model to generalize to packets unseen in the training set.

LCM
LCS

Paper

Learning-Based Frequency Estimation Algorithms
Chen-Yu Hsu, Piotr Indyk, Dina Katabi, Ali Vakilian
International Conference on Learning Representations (ICLR) 2019
[PAPER]

Press

LearnedSketch was covered by: Forbes, The Next Web, VentureBeat, MIT news, and other media outlets.

MIT 6.890

Also check out the new class offered in Spring 2019 at MIT on Learning-Augmented Algorithms!
[Course Website]

Funded by the National Science Foundation (TRIPODS program)