Our initial approach to home timelines, which used the existing, fully-normalized data structure that we've already built, is technically viable but will perform very poorly at scale. If I follow F users and want a page of size P for my home timeline, Cassandra will need to do the following:
- Query F partitions for P rows, each
- Perform an ordered merge of FxP rows in order to retrieve only the most recent P
The most distressing part of this is the fact that both operations grow in complexity proportionally with the number of people I follow. Let's start by trying to fix this.
The basic goal of the home timeline is to show me the most recent status updates that matter to me. Instead of doing all the work to find out what status updates matter to me, based on whom I follow, at read time, let's shift some of the work to write time.
I'll create a table that stores references...