Heap is a web and iOS analytics tool that automatically captures every user interaction, eliminating the need to define events upfront and allowing for flexible, retroactive analysis.
When we had the idea for Heap, it wasn’t clear whether its underlying tech would be financially tenable.
Plenty of existing tools captured every user interaction, but none offered much beyond rigid, pre-generated views of the underlying data. And plenty of tools allowed for flexible analysis (funnels, segmentation, cohorts), but only by operating on pre-defined events that represent a small subset of overall usage.
To our knowledge, no one had built: 1) ad-hoc analysis, 2) across a userbase’s entire activity stream. This was intimidating. Before we started coding, we needed to estimate an upper-bound on our AWS costs with order-of-magnitude accuracy. Basically: “Is there a sustainable business model behind this idea?”
To figure this out, we started with the smallest unit of information: a user interaction.
Estimating Data Throughput
Every user interaction triggers a DOM event. We can model each DOM event as a JSON object:
{ referrer: url: type: 'click', target: 'div#gallery div.next', timestamp: 845 ... }
With all the properties Heap captures, a raw event occupies ~1 kB of space.
Our initial vision for Heap was to offer users unadulterated, retroactive access to the DOM event firehose. If you could bind an event handler to it, we wanted to capture it. To estimate the rate of DOM event generation, we wrote a simple script:
var start = Date.now, eventCount = 0; for (var k in window) { // Find all DOM events we can bind a listener to if (k.indexOf('on') === 0) { window.addEventListener(k.slice(2), function(e){eventCount++}); } } setInterval(function{ var elapsed = (Date.now - start) / 1000; console.log('Average events per second: ' + eventCount / elapsed); }, 1000);
Try it out yourself. With steady interaction, you’ll generate ~30 DOM events per second. Frenetic activity nets ~60 events per second. That’s a lot of data, and it resulted in an immediate bottleneck: the client-side CPU and network overhead.
Luckily, this activity mostly consists of low-signal data: mousemove, mouseover, keypress, etc. Customers don’t care about these events, nor can they meaningfully quantify it. By restricting our domain to high-signal events – click, submit, change, push state events, page views – we can reduce our throughput by almost two orders of magnitude with negligible impact on data fidelity.
With this subset of events, we found via manual testing that sessions rarely generate more than 1 event per second. We can use this as a comfortable upper-bound. And how long is the average session duration? In 2011, Google Analytics provided aggregate usage benchmarks and their latest figures claimed an average session lasted about 5 minutes and 23 seconds.
Note that the estimate above is the most brittle step of our analysis. It fails to account for the vast spectrum in activity across different classes of apps (playing a game of Cookie Clicker is more input-intensive than reading an article on The Economist). But we’re not striving for perfect accuracy. We just need to calculate an upper-bound on cost that’s within the correct order of magnitude.
By multiplying the values above, we find that a typical web session generates 323 kB of raw, uncompressed data.
Architectural Assumptions and AWS
We have a sense of the total data generated by a session, but we don’t know the underlying composition. How much of this data lives on RAM? On SSD? On spinning disks?