Validity in User Experience Research

In systematic research we attempt to answer a research question with enough rigor to be confident in our results, and confident in the decisions we make based on those results. Many people assume that confidence requires a large sample size, thanks to the commonly known formula for statistical significance using a p-value of 0.05 (or in extreme cases 0.01). The p-value is another way of saying that this is a proportion of false alarms, or unrepresentative results, that we’re willing to tolerate. There are four assumptions here:
1. We have a fixed, high tolerance across every type of research
2. We are prioritizing proportions (or quantitative values) as outcomes
3. We care about every false alarm
4. Confidence is determined in a vacuum
In typical UX research, none of these assumptions apply. So we use alternative scientific practices to determine our sample size and in turn, our confidence. If you’re interested in the details, read on.
Our tolerance isn’t fixed, it needs to be good enough to make a judgement without breaking the bank.
Qualitative research is incredibly expensive. How much would you spend to know that the new product specifically has 25% higher success rates for a task? How much less would you spend to know that the new product specifically has 20-30% higher success? If you’re willing to tolerate a 10% swing, you need far fewer participants, and can get more bang for your buck. As such, whenever you hear UX research results, you’ll often hear quantitative results as an approximation. The concept of adjusting your confidence based on context and applications is called practical significance. 
UX research generally doesn’t prioritize quantitative values. Instead, it’s about insights.
In qualitative research we seek to understand the ‘why’ behind the ‘what’. One of the goals of UX research is to learn ‘what are the triggers that cause successes or failures along the user journey’. This means that we prioritize insights or meaning (representing the triggers) over quantitative values. Another way of thinking about it is: while it’s informative to understand product health as well as success or failure rates, it’s even more important to understand why people are succeeding or failing, so that we can learn from our findings. For example, the top three reasons we’re seeing failures are X, Y, and Z. It turns out that you need fewer participants to start to see trends in insights. In fact, one of the most common ways to determine your sample size in qualitative research is called ‘saturation’. Saturation is the concept of seeing fewer and fewer new insights as your sample size increases. Many researchers stop adding participants when there is diminishing return on studying more participants.
When we prioritize the most common insights we don’t need to care about every false alarm.
Recall that the p-value is a value representing the proportion of false alarms we’re willing to tolerate. In UX research we often prioritize common findings over uncommon findings. For example, what are the most common errors? What are the most common reasons people make those errors? If we were building flight software or medical tools and the lives of individuals are at stake, then we’d need to identify and understand all errors, whether common or uncommon. In working with data solutions, we have the liberty of prioritizing the most common issues and insights. After all, data is informative, not directive. It turns out that the p-value doesn’t work well when you’re only looking for the most common findings. Rather than asking what number of false alarms we’re willing to tolerate, we should ask: For the issues that we do find, how confident are we that they are valid? Since we’re looking for the most common issues, and we have some foundation of why we’re seeing certain outcomes, a better description of the significance is the binomial formula. How many times do you need to flip a coin to be 85% sure you’ll see tails at least once? 3 or fewer. Let’s say your user experience has an issue that will impact 30% or more of your users. If you test that experience with 5 users, there is an 85% chance that you will observe that issue in the study. In contrast, if a problem only affects 2% of your users, then you have less than 0.096 probability of observing it in 1 or more users. 
Our confidence isn’t determined in a vacuum. UX research is summative and triangulated.  
Let’s say you’re conducting a coin toss study to test whether the coin is weighted. You can run your study with a p-value of 0.05 to be confident in your results. But what if could test the coin in other physical ways? What if you could observe the manufacturing process? What if you’ve bought 100 coins from the same manufacturer and 99 of them were weighted? Now, that the research is summative and triangulated, how many times should you toss the coin to be confident in your results? That’s UX research. 
To summarize, in order to conduct valid UX research we apply the saturation method, supported by the binomial formula, we represent values as approximations, and we triangulate across methods.
Questions? Comments? You can reach me at Tira.Schwartz@Socrata.com.
-Tira Schwartz, Principal UX Researcher