Beyond Triangle Tests

Blind triangle tests in the homebrewing wold were probably popularized by Brülosophy. Having conducted and evaluated one of my own, as well as having participated in it myself, I would like to share my thoughts on it in a short essay and also point out the limitations and meaning of such tests.

What are triangle tests?

Essentially, triangle tests are a statistical tool to measure statistically valid differences between two samples, in this case beer. To do this, the taster is given three samples, two of which are the same, "A", and the third is different, "B". Translated to beer, this means in an experiment where the variable is, for example, the maltster, that the taster receives two beers brewed with Weyermann Pilsner malt and one beer brewed with Ireks Pilsner malt. The taster must now selectthe sample that is (sensory) different from the other two. In short: Can you taste a difference?

Statistics

When is a triangle test statistically significant? In other words, when can we generally say that there is a sensory difference between the two samples/beers? To do this, we first have to define the p-value. This tells us how high the probability is that the result occured by chance, and not by the actual (non-existent) difference between the two beers. This is by default 5%. That means at 5% it is possible that the result is considered "statistically significant", but this result only occured because the testers randomly guessed correctly often enough. (Strictly speaking, the value indicates the probability that if the experiment is repeated under the given distribution of outcomes while the null hypothesis is true, a result like the present one will occur).
If we want to make the random value smaller, we need more testers to select the correct beer; if we increase the value, we need fewer testers to select the beer correctly.
Given a p-value, we can find the number of participants, who must select the correct sample, using the binomial distribution. With 25 participants, 13 have to select the correct beer. With 50 participants, 23 have to.

Let us now turn to what this means in practice and what we can learn (or not) from triangle tests.

Consideration 1: Number of participants

With small groups of participants, we need a comparatively high proportion of correctly identified beers to achieve statistical significance while keeping the p-value constant (5%). This is a statistical artifact resulting from the avoidance of randomness.

Num. particip.    req. correct      share
10                7                 70%
25                13                52%
50                23                46%
100               42                42%
500               185               37%
5000              1723              34%
n->infinity       n/3               33.34%

If we look at the proportion of participants who have to be correct as the total number of participants increases, we see relatively quickly that the value is extremely high at the beginning and then approaches 1/3 + 1. This is also logical, since we expect exactly one third of randomly "correctly" selected beers when all three beers are the same.
Again, for relatively small samples (e.g., 25 participants), this means that there must be an above-average number of additional correct results (compared to 1/3). Even with 50 participants, the difference is 6 additional participants, that need to be correct.

Consideration 2: The sample group

In most "homebrew tests" a semi-random group is tested. Some beer drinkers that could be found at the local craft beer bar and other homebrewers. Assuming 12/25 people were successful in the test, the implication is that the test was not statistically significant. Now let's take a few (say 5) of the people who were correct and very certain, and have them take the test 5 times each and would the result would be 20/25 correct guesses. This would then make the test statistically significant. If that is reproducible in the group, it is valid. \ The argument of "sensory weak and strong people" has been brought up a few times on Brülosophy. I think it is very valid, because I have had some people who were very sure and were right, and most of the ones who were wrong were unsure.
So it may depend on the group selected in each case and the test conditions (see below).

Consideration 3: Not statistically significant doesn't mean not different

The biggest misconception drawn from a non-signifcant triangle test is that the beers are not (sensory) different. The test only implies that in a group of more or less random people no statistically significant proportion was able to find a difference in a blind test. This does not mean that another person would not actually be able to taste a difference in a repeatable way. Or that a tester might not actually be able to tell the beers apart if he didn't drink them in a test atmosphere out of 3 plastic cups.
In my Vienna Lager test I noticed this quite significantly. It was really hard to pass the triangle test with small glasses blindfolded repeatedly, but there was still a clear taste difference between the beers when I drank the beers side by side, from large glasses.

Considration 4: Circumstance

Most "homebrew tests" may not be conducted in optimal conditions, which can make it much harder to taste the difference. These include:

Testers have had several beers before, perhaps including a strong IPA.
No water or snacks are available for neutralization.
The beer is poured in smaller glasses (or opaque plastic cups) that have a different sensory feel.
Pouring into smaller glasses changes the flavor and smell of the beer. I have noticed this especially when pouring beers from tap.
The environment is noisy.

Consideration 5: The sum of small changes

In particular, if you take "not statistically significant" as "makes no difference" and thus combine several variables that were not statistically significant on their own when brewing, significant differences can occur just then. Warm fermentation temperature at lager was not statistically significant, boiling for 30 minutes instead of 90 minutes was not statistically significant, complete trub or little trub in fermentation was not statistically significant, mash temperature was not statistically significant, etc. Now if you combine all of these variables, the beer is more likely to be different than if you only change one variable.

Conclusion

A triangle test is a lot harder than you imagine if you've never done one before. The results very quickly tempt you to say, "It was shown that there was no difference, so now I'll take such-and-such a shortcut when brewing." In my opinion, those shortcuts add up quickly. Most things make a difference. Even if you try to brew the beer exactly the same a second time, it's already not necessarily easy. Especially with my favorite beers from Franconian breweries, I've learned that every detail matters in brewing a perfect beer, and not just any beer.
The most important thing is to test these variables yourself, and decide for yourself if you like the results.

Hopload Blog.

Beyond Triangle Tests