I was looking at some performance metrics around file upload the other day and noticed some really large outliers. It was being reported that some operations were taking over 20 hours! After digging through some of the data, I noticed that the there were several data points at 20 hours, and they were all coming from the same user.
This was suspicious to me - as if all operations had halted on the users computer and restarted at around the same time. One possible explanation was that the computer fell asleep, so the CPU halted, and upon restart, the CPU restarted and finished the operation. We were using performance.now
()
to measure how long operations took, and I expected it to measure CPU time (since DateTime
measures Unix time).
Turns out that is not the case! According to the docs, performance.now
()
measures “time elapsed since Performance.timeOrigin
which is the time when navigation has started in window contexts.” There’s a specific call out in the documentation that the performance.now
()
specification requires that performance.now
()
also ticks during sleep.
💡 You might be wondering, why use
performance.now
() instead of
Date.now
() ? The difference is precision.
performance.now
() has floating point numbers with up to microsecond precision, whereas
Date.now
() measures at the one millisecond resolution.
What do I do now? My goal is to get a high signal metric on how long these operations are taking while the CPU is running at full speed. My first guess was to eliminate samples that were “way too high,” but it was difficult to figure out a threshold that didn’t eliminate numbers that were representative of the user experience. I was reading online about performance.now
()
and ran into a blog post by Conrad Irwin (co-founder of Superhuman) which mentions you can ignore metrics where, in between the beginning and end of the metric calculation, the tab or window becomes backgrounded (user switches tab, computer goes to sleep). Note that browsers also lower the priority of operations of a non-foreground tab, and as a result, the operations appear to run slower. The code snippet from the blog post:
let lastVisibilityChange = 0
window.addEventListener('visibilitychange', () => {
lastVisibilityChange = performance.now()
})
// don’t log any metrics started before the last visibility change
// or if the page is hidden
if (metric.start < lastVisibilityChange || document.hidden) return
I decided for our implementation, I would continue emitting all the metrics, but tag metrics that had occurred while the CPU wasn’t running at full speed so we could see the difference and filter by that metric. This indeed fixed the issue we were seeing with outliers and I was able to get a much higher signal metric that we could depend on.