As we all know, measuring things is a good way to get concrete information. Now that Firefox CI is fully on Taskcluster this was a good opportunity to measure and see what we can learn about timing in localization tasks.
The code that generated the following charts and data is viewable in my Github repo, though as of this writing the code is rough and was modified manually to help generate the related graphs in various configurations. I’m working on adjusting them to be more general purpose and have better actual testing around logic.
One of the first things I looked at was per-task duration for the actual nightly-l10n task on beta/release by Gecko version per platform. [Note: Android is missing on these graphs because we do not do Android Single Locale on Beta/Release anymore, so we don’t have metrics to compare] (graphs and takeaways after the jump)
With this graph, it is clear both, that we made some general improvements between Gecko 59 and Gecko 60, and that Windows takes significantly longer than OSX/Linux (both of the latter are performed on linux hosts).
Nothing in this graph was all that surprising to me, and it meshed with my preconceived understanding of the current state.
Next I was wondering if the data would be different if I split the tasks down to “per locale” timing because the number of locales run in a single task, while roughly uniform for a given release could vary from release to release, especially as we add new locales or change chunking values, etc.
This was interesting in that it shows the per locale time is amounts to just a bit under 10 minutes per locale for linux, and at least double that for windows, and it appears there is a slight regression for windows on beta, but with the variability in earlier windows tasks it’s hard to conclude much (though we did reduce some of that variability in 61/62 releases)
Now over to the Nightly side of things to look closer, I realized that using plot.ly public charting wouldn’t let me do the nightly graphs for all platforms at once (too many data points!) I resorted to a bit of back and forth locally. But eventually made a full-nightly graph of all windows, split by week, per task. [This eventually was able to publish to plotly public api]
With this one you’ll notice the obvious task regression near the end of 2017 and the smaller one near the end of the Gecko 62 cycle…
I spent a bit of time, and looked into what happened at the end of 2017, but turns out this wasn’t really a regression so much as a stabilization. Prior to that, we had done 2 locales per task to work around some signing timing issues we hit early in the tc-migration work, and decided it was ok to bump to ~5 locales per task. This, of course, bumps our per-task time up relatively significantly, however as you can see on my next graph, it doesn’t actually change the real metric much… (per Locale graph)
Unlike the first regression (end of 2017), the “just before Gecko 62” regression is still present in this graph. In offline mode (sorry no plotly link for these) I dug in deeper to see what could be causing this… (the following Graph has all platforms as my own mini sanity check)
Once I did some data-mining on my raw data, I found the changesets and taskIDs involved and traced it down to an actual regression in hg clone/update times, and filed Bug 1474159. That resulted in the OCC (Open Cloud Config) Pull Request that is currently awaiting landing and deployment.
If I had to take one thing away from this mini exercise is that there is certainly value in doing these types of graphs, (and more) and that we should do some investment in more types of metrics like this going forward and keep them up, where possible.
On the personal front I’ve learned a few new tools for my own toolbelt here, Plotly, the python package ‘Pandas’, etc. I’ll see how we can put these tools to better use in the coming months.