Jonasfj.dk/Blog
A blog by Jonas Finnemann Jensen


April 1, 2015
Playing with Talos in the Cloud
Filed under: Computer,English,Linux,Mozilla by jonasfj at 9:30 am

As part of my goals this quarter I’ve been experimenting with running Talos in the cloud (Linux only). There are many valid reasons why we’re not already doing this. Conventional wisdom dictates that visualized resources running on hardware shared between multiple users is unlikely to have consistent performance profile, hence, regressions detection becomes unreliable.

Another reason for not running performances tests in the cloud, is that a cloud server is very different from a consumer laptop, and changes in performance characteristic may not reflect the end-user experience.

But when all the reasons for not running performance testing in the cloud have been listed, and I’m sure my list above wasn’t exhaustive. There certainly is some benefits to using the cloud, on-demand scalability and cost immediately springs to mind. So investigating the possibility of running Talos in the cloud is interesting, if not thing more it could be used for fast smoke tests.

Comparing Consistency of Instance Types

First thing to evaluate is the consistency of results depending on instance-type, cloud provider and configuration. For the purpose of these experiments I have chosen instances and cloud providers:

  • AWS EC2 (m3.medium, m3.xlarge, m3.2xlarge, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.2xlarge, r3.large, r3.xlarge, g2.2xlarge)
  • Azure (A1, A2, A3, A4, D1, D2, D3, D4)
  • Digital Ocean (1g-1cpu, 2g-2cpu, 4g-2cpu, 8g-4cpu)

For AWS I tested instances in both us-east-1 and us-west-1 to see if there was any difference of results. In each case I have been using two revisions c448634fb6c9 which doesn’t have any regressions and fe5c25b8b675 which has clear regressions in test suites cart and tart. In each case I also ran the tests with both xvfb and xorg configured with dummy video and input drivers.

To ease deployment and ensure that I was using the exact same binaries across all instances I packaged Talos as a docker image. This also ensured that I could reset the test environment after each Talos invocation. Talos was invoked to run as many of the test suites as I could get working, but for the purpose of this evaluation I’m only considering results from the following suites:

  • tp5o,
  • tart,
  • cart,
  • tsvgr_opacity,
  • tsvgx,
  • tscrollx,
  • tp5o_scroll, and
  • tresize

After running all these test suites for all the configurations of instance type, region and display server enumerated above, we have a lot of data-points on the form results(cfg, rev, case) = (r1, r2, ..., rn), where ri is the measurement from the i’th iteration of the Talos test case case.

To compare all this data with the aim of ranking configurations by the consistency of their results, compute rank(cfg, rev, case) as the number of configurations cfg' where rank(cfg', rev, case) < rank(cfg, rev, case). Informally, we sort configurations based lowest standard deviation for a given case and rev and the index of a configuration in that list is the rank rank(cfg, rev, case) of the configuration for the given case and rev.

We then finally list configurations by score(cfg), which we compute as the mean of all ranks for the given configuration. Formally we write:

score(cfg) = mean({rank(cfg, rev, case) | for all rev, case})

Credits for this methodology goes to Roberto Vitillo, who also suggested using trimmed mean, but as it turns out the ordering is pretty much the same.

When listing configurations by score as computed above we get the following ordered lists of configurations. Notice that the score is strictly relative and doesn’t really say much. The interesting aspect is the ordering.

Warning, the score and ordering has nothing to do with performance. This strictly considers consistency of performance from a Talos perspective. This is not a comparison of cloud performance!

Provider:       InstanceType:   Region:     Display:  Score:
aws,            c4.large,       us-west-1,  xorg,     11.04
aws,            c4.large,       us-west-1,  xvfb,     11.43
aws,            c4.2xlarge,     us-west-1,  xorg,     12.46
aws,            c4.large,       us-east-1,  xorg,     13.24
aws,            c4.large,       us-east-1,  xvfb,     13.73
aws,            c4.2xlarge,     us-west-1,  xvfb,     13.96
aws,            c4.2xlarge,     us-east-1,  xorg,     14.88
aws,            c4.2xlarge,     us-east-1,  xvfb,     15.27
aws,            c3.large,       us-west-1,  xorg,     17.81
aws,            c3.2xlarge,     us-west-1,  xvfb,     18.11
aws,            c3.large,       us-west-1,  xvfb,     18.26
aws,            c3.2xlarge,     us-east-1,  xvfb,     19.23
aws,            r3.large,       us-west-1,  xvfb,     19.24
aws,            r3.large,       us-west-1,  xorg,     19.82
aws,            m3.2xlarge,     us-west-1,  xvfb,     20.03
aws,            c4.xlarge,      us-east-1,  xorg,     20.04
aws,            c4.xlarge,      us-west-1,  xorg,     20.25
aws,            c3.large,       us-east-1,  xorg,     20.47
aws,            c3.2xlarge,     us-east-1,  xorg,     20.94
aws,            c4.xlarge,      us-west-1,  xvfb,     21.15
aws,            c3.large,       us-east-1,  xvfb,     21.25
aws,            m3.2xlarge,     us-east-1,  xorg,     21.67
aws,            m3.2xlarge,     us-west-1,  xorg,     21.68
aws,            c4.xlarge,      us-east-1,  xvfb,     21.90
aws,            m3.2xlarge,     us-east-1,  xvfb,     21.94
aws,            r3.large,       us-east-1,  xorg,     25.04
aws,            g2.2xlarge,     us-east-1,  xorg,     25.45
aws,            r3.large,       us-east-1,  xvfb,     25.66
aws,            c3.xlarge,      us-west-1,  xvfb,     25.80
aws,            g2.2xlarge,     us-west-1,  xorg,     26.32
aws,            c3.xlarge,      us-west-1,  xorg,     26.64
aws,            g2.2xlarge,     us-east-1,  xvfb,     27.06
aws,            c3.xlarge,      us-east-1,  xvfb,     27.35
aws,            g2.2xlarge,     us-west-1,  xvfb,     28.67
aws,            m3.xlarge,      us-east-1,  xvfb,     28.89
aws,            c3.xlarge,      us-east-1,  xorg,     29.67
aws,            r3.xlarge,      us-west-1,  xorg,     29.84
aws,            m3.xlarge,      us-west-1,  xvfb,     29.85
aws,            m3.xlarge,      us-west-1,  xorg,     29.91
aws,            m3.xlarge,      us-east-1,  xorg,     30.08
aws,            r3.xlarge,      us-west-1,  xvfb,     31.02
aws,            r3.xlarge,      us-east-1,  xorg,     32.25
aws,            r3.xlarge,      us-east-1,  xvfb,     32.85
mozilla-inbound-non-pgo,                              35.86
azure,          D2,                         xvfb,     38.75
azure,          D2,                         xorg,     39.34
aws,            m3.medium,      us-west-1,  xvfb,     45.19
aws,            m3.medium,      us-west-1,  xorg,     45.80
aws,            m3.medium,      us-east-1,  xvfb,     47.64
aws,            m3.medium,      us-east-1,  xorg,     48.41
azure,          D3,                         xvfb,     49.06
azure,          D4,                         xorg,     49.89
azure,          D3,                         xorg,     49.91
azure,          D4,                         xvfb,     51.16
azure,          A3,                         xorg,     51.53
azure,          A3,                         xvfb,     53.39
azure,          D1,                         xorg,     55.13
azure,          A2,                         xvfb,     55.86
azure,          D1,                         xvfb,     56.15
azure,          A2,                         xorg,     56.29
azure,          A1,                         xorg,     58.54
azure,          A4,                         xorg,     59.05
azure,          A4,                         xvfb,     59.24
digital-ocean,  4g-2cpu,                    xorg,     61.93
digital-ocean,  4g-2cpu,                    xvfb,     62.29
digital-ocean,  1g-1cpu,                    xvfb,     63.42
digital-ocean,  2g-2cpu,                    xorg,     64.60
digital-ocean,  1g-1cpu,                    xorg,     64.71
digital-ocean,  2g-2cpu,                    xvfb,     66.14
digital-ocean,  8g-4cpu,                    xvfb,     66.53
digital-ocean,  8g-4cpu,                    xorg,     67.03

You may notice that the list above also contains the configuration mozilla-inbound-non-pgo which has results from our existing infrastructure. It is interesting to see that instances with high CPU exhibits lower standard deviation. This could be because their average run-time is lower, so the standard deviation is also lower. It could also be because they consist of more high-end hardware, SSD disks, etc. Higher CPU instances could also be producing better results because they always have CPU time available.

However, it’s interesting that both Azure and Digital Ocean instances appears to produce much less consistent results. Even their high-performance instances. Surprisingly, the data from mozilla-inbound (our existing infrastructure) doesn’t appear to be very consistent. Granted that could just be a bad run, we would need to try more revisions to say anything conclusive about that.

Unsurprisingly, it doesn’t really seem to matter what AWS region we use, which is nice because it just makes our lives that much simpler. Nor does the choice between xorg or xvfb seem to have any effect.

Comparing Consistency Between Instances

Having identified the Amazon c4 and c3 instance-types, as the most consistent classes, we now proceed to investigate if results are consistent when they are computed using difference instances of the same type. It’s well known that EC2 has bad apples (individual machines that perform badly), but this is a natural thing in any large setting. What we are interested in here is what happens when we compare results different instances.

To do this we take the two revisions c448634fb6c9 which doesn’t have any regressions and fe5c25b8b675 which does have a regression in cart and tart. We run Talos tests for both revisions on 30 instances of the same type. For this test I’ve limited the instance-types under consideration to c4.large and c3.large.

After running the tests we now have results on the form results(cfg, inst, rev, suite, case) = (r1, r2, ... rn) where ri is the result from the i’th iteration of the given test case under the given test suite, revision, configuration and instance. In the previous section we didn’t care which suite the test case belonged to. We care about suite relationship here because we compute the geometric mean of the medians of all test cases per suite. Formally we write:

score(cfg, inst, rev, suite) = geometricMean({median(results(cfg, inst, rev, suite, case)) | for all case})

Credits to Joel Maher for helping figure out how the current infrastructure derives per suite performance score for a given revision.

We then plot the scores for all instances as two bar-chart series one for each revision. We get the following plots. I’ve only included 3 here for brevity. Each pair of bars is results from one instance on different revisions, the ordering here is not relevant.

From these two plots it’s easy to see that there is a there is a tart regression. Clearly we can also see that performance characteristics does vary between instances. Even in the case of tart it’s evident, but it’s still easy to see the regression.

Now when we consider the chart for tresize it’s very clear that performance is different between machines. And if a regression here was small, it would be hard to see. Most of the other charts are somewhat similar, I’ve posted a link to all of them below along with references to the very sketchy scripts and hacks I’ve employed to run these tests.

Next Steps

While it’s hard to conclude anything definitive without more data. It seems that the C4 and C3 instance-types offers fairly consistent result. I think the next step is to setup a subset of Talos tests running silently along side existing tests while comparing results to regressions observed elsewhere.

Hopefully it should be possible to use a small subset of Talos tests to detect some regressions early. Rather than having all Talos regressions detected 12 pushes later. Setting this up is not going to a Q2 goal for me, but I should be able to set it up on TaskCluster in no time. At this point I think it’s mostly a configuration issue, since I already have Talos running under docker.

The hard part is analyzing the resulting data and detect regressions based on it. I tried comparing results with approaches like students t-tests. But there is still noisy tests that have to be filtered out, although preliminary findings were promising. I suspect it might be easiest to employ some naive form of Machine learning, and hope that magically solves everything. But we might not have enough training data.

4 Comments »

  1. Maybe I’ve missed it, but did you compare consistency between the cloud instances to the current talos infrastructure?

    Different tests have different consistencies/noise-levels, and while finding the most consistent cloud platform is important, it’s also important to compare those to the consistency levels we have now, preferably per test.

    Comment by avih — April 1, 2015 @ 12:48 pm

  2. @avih,
    I don’t have enough data from current talos infrastructure to say anything conclusive.
    But if you look at the list of instance-types ordered by score. You’ll see that I include:
    “mozilla-inbound-non-pgo”
    Which is the results from the two revisions as compute by current talos infrastructure. I believe it ran with the same binaries too (I stole the data from datazilla).

    But yeah, it would be interesting to also run those two revisions 30 times on our current talos infrastructure to see how the tests behave, compared to how they behave on c3.large and c4.large.

    Note: I’ve not seen any evidence to suggest that per suite geometric mean of medians over cases, is a valid measure for anything. So maybe there is some additional work in finding a proper methodology for a better comparison too.

    Comment by jonasfj — April 1, 2015 @ 8:04 pm

  3. Do we have any way of weeding out bad-apple EC2 spot instances at the start of a test run? We see inconsistencies in our CI infra that could be explained by similar issues and it would be great if we could devise a way to catch and terminate these instances before they take jobs.

    Comment by RyanVM — April 4, 2015 @ 2:49 pm

  4. @RyanVM,
    I see bad-apples as something that happens no matter what infrastructure we use.
    Solving the problem is non-trivial, long-term we might do some generic bad-apple
    detection at worker-level. But could also be a waste of time if non-talos tests
    are less sensitive to bad-apples or bad-apples are too rare.

    Specifically, for talos tests I think this can be mitigated by chunking test cycles.
    Right now we run many suites with tppagecycles = 25, that is for each test-page in a
    suite like cart, tart, tp5o and others we open the test-page 25 times.
    Instead of doing 25 cycles in one tasks (on one machine) we could do 10 cycles in
    3 tasks which is likely to run on 3 different machines.

    When comparing the results we’ll then notice it, if 10 cycles from the same task is
    very different from 20 cycles from 2 other tasks. Ideally, we would do this comparison
    in a decision task that would be able to schedule more tasks/cycles if we suspect a
    bad apple is to blame. This way we can distinguish bad-apples from changes that introduce
    higher variations/intermittence in a test case.

    As an added bonus it would also be faster to run 3 tasks with 10 cycles in parallel,
    than it is to run one task with 25 cycles (as we have no parallelism). If talos is
    packaged efficiently with docker the only overhead is downloading/extracting the
    Firefox binary from S3.

    Comment by jonasfj — April 4, 2015 @ 11:24 pm

Leave a comment