Understanding MANOVA: Basics, Assumptions, and Mathematical Foundations

Convert Your Audio To Text

4.9/5

3726 customer reviews

Explore the fundamentals of MANOVA, its assumptions, and the math behind it. Learn how it compares to ANOVA and LDA, and understand its application in statistical analysis.

MANOVA - explained with a simple example

Added on 09/30/2024

Speakers

Add new speaker

Speaker 1: Welcome to this lecture about MANOVA. MANOVA is based on the same math as LDA, so you need to watch that video first to fully understand the details behind MANOVA. In this lecture, we'll have a look at the basics of a MANOVA. At the end of this lecture, we'll discuss the assumptions for MANOVA and the basic math behind its calculations. Before we look at MANOVA, let's first discuss ANOVA. Remember that ANOVA is a statistical method that can be used to compare the means of two or more independent groups. Usually, we use ANOVA only when we like to compare three or more groups, and use the t-test when we like to compare just two groups. In this example, we like to compare the mean tumor size of independent patients with prostate cancer on three different treatments, A, B, and C. The null hypothesis in this example states that there is no difference in the mean tumor size between the three different treatments. MANOVA, or the multivariate analysis of variance, is the multivariate version of ANOVA. MANOVA tests whether there is a difference between two or more independent groups, based on two or more dependent variables. In this case, we have two dependent variables, which represent the tumor size and the prostate-specific antigen concentration in the blood. Note the difference between ANOVA and MANOVA. ANOVA is based on only one dependent variable, whereas MANOVA is based on at least two dependent variables. The null hypothesis of the MANOVA in this example states that the mean tumor size is equal for all treatment groups, and that the PSA concentration is also equal across the treatments. According to the null hypothesis, the three treatments should therefore have the same effect on the two variables. We'll here use the following fictive data to compute the MANOVA. This is the exact same data set as we used in the video about LDA. The data represents measurements of some clinical variables on 12 patients. This column shows if the patient has a viral or bacterial infection. This column shows the concentration of the C-reactive protein in blood from the time when the patients entered the hospital, whereas this column shows the body temperature of the same patients at the same time point. We can illustrate the data like this, where these darker bars represent the mean CRP concentration of the patients, whereas these bars represent the mean body temperature. The null hypothesis for this example states that the patients with the bacterial infection had the same mean CRP concentration and body temperature as the ones with the viral infection. We will now test the hypothesis that the mean CRP and the mean body temperature are equal between the two groups. If we run a MANOVA on the data, we'll get the p-value of 0.001, which is less than the general significance level of 0.05. We can therefore reject the null hypothesis and conclude that there is a significant difference in the mean body temperature and CRP concentration between the two groups. When we reject the null hypothesis of the MANOVA, we might want to continue with the postdoc test. One could therefore run two separate ANOVAs or two separate t-tests. For example, we could run an ANOVA to check if there is a difference in the mean CRP concentration between the two groups, and run a second ANOVA to test if there is a difference in the mean body temperature between the two groups. However, when we study the two variables separately, it is possible that we then no longer can find a significant difference between the groups. This might happen if the groups are well separated when we study the two variables simultaneously. Compared to when we study the variables separately, where no clear separation between the two groups can be found. In addition, MANOVA has also greater power compared to separate univariate tests. MANOVA relies on the following assumptions. The first assumption is that the groups that are compared should be independent. The second assumption is the so-called multivariate normality. To understand multivariate normality, let's have a look at the normality in the univariate case where we study each variable separately. In this case, we see that the variable CRP has normal distribution. We can also see that the body temperature is normally distributed. Now imagine that we would plot the distribution of the two variables against each other. Assuming that they have multivariate normality, we would expect a bell-shaped surface in a three-dimensional plot like this. There are certain tests that can test for multivariate normality, such as the generalized Shapiro-Wilkes test. The third assumption is homogeneity of the covariance matrices. This is similar to the assumption for MANOVA where we assume that the variances across the groups are equal. Since we have more than one dependent variable in MANOVA, the assumption is that the covariance matrices for the different groups should be equal. This is the computed covariance matrix for the viral group. And this is the covariance matrix for the bacterial group. We see that the two covariance matrices are fairly similar, which indicates that we fulfill the assumption of homogeneity. We can test if the two covariance matrices are equal by using the Box's M-test, where the null hypothesis states that the covariance matrices are equal across the groups. If we run this test on our data, we get the following p-value, which tells us that we should not reject the null hypothesis. This means that we fulfill the assumption of homogeneity. The next assumption is that there should be no multicollinearity, which means that there should not be a too strong correlation between the dependent variables. According to this scatter plot, there is no indication that there is a strong linear relationship between the two dependent variables, CRP and body temperature. If the data would look something like this, with a very strong linear relationship between the two variables, we would have problems with multicollinearity. The final assumption is that there should be a linear relationship between the dependent variables for each group. We can see that the two variables show a fairly linear pattern in the two groups. At least there is no indication of any nonlinear patterns. Thus, there should not be a nonlinear pattern like this, for example. We'll now have a look at the basic math behind MANOVA, so that you can understand the outputs from statistical software tools. MANOVA is based on similar math as LDA. Whereas LDA involves the calculation of the covariance matrix, then MANOVA is instead using the so-called sums of squares and cross product matrix. In the lecture about LDA, we used the covariance matrix, which was computed based on all the data points, where this element represents the variance of the variable CRP, and this is the variance of the body temperature. This is the covariance of the two variables. Remember that the variance is calculated by the following formula. The numerator involves the sum of the square differences between observations and the mean. This is usually called the sum of squares. The covariance is calculated by the following equation. This part of the equation is sometimes called cross product. If we would now calculate the sums of squares and cross product matrix for the total variation, we would get the following matrix. This element represents the sum of squares of the CRP variable. By dividing this value by n-1, which is 11 in our example, since we have 12 individuals, we would get the same value as in the covariance matrix. And if we would divide 48 by 11, we would get 4.4. These two elements represent the sum of the cross products, which can be calculated by using only the numerator in the equation for the covariance. By dividing these values by n-1, we will get the covariance between the CRP and body temperature. The sums of squares and cross product matrix can also be calculated by the following formula, where D represents the data set with centered values. Let D be a matrix that represents the centered values of this data set, whereas this represents the transpose of this matrix. We can then calculate the sums of squares and cross product matrix for the total variation like this. These are the centered CRP values, where the mean of the CRP values have been subtracted from the original CRP values. And these are the centered body temperature values. The product of these two matrices results in a 2 by 2 matrix, representing the sums of squares and cross product matrix of the total variation. We will now see how we can calculate the within groups sums of squares and cross product matrix. We first calculate the centered values of the variable group by subtracting the mean CRP and body temperature from the original values. These are the centered values for the viral group. For example, this value has been calculated by subtracting the mean CRP of the viral group, which is 19.4, from the CRP value of the first person. We now multiply these two matrices, which will give us the sums of squares and cross product matrix for the virus group. Next we do the same calculations for the bacterial group. Note that the mean centered values have been calculated based on the means of the bacterial group. Finally, we add these two matrices, so that we get the following within groups sums of squares and cross product matrix. Once we have calculated the total and within sums of squares and cross product matrices, we can calculate the between groups sums of squares and cross product matrix by subtracting the within from the total. This gives us the following between groups sums of squares and cross product matrix. We now have all the sums of squares and cross product matrices. Next we calculate the matrix for the separation between the groups by multiplying the inverse of the within groups sums of squares and cross product matrix by the between groups sums of squares and cross product matrix. Remember that we did a similar calculation for the linear discriminant analysis, where we multiply the inverse of the pooled within group covariance matrix by the between group covariance matrix. Next we calculate the eigenvalues of matrix S. In comparison to LDA, we are now only interested in the eigenvalues for the MANOVA, not the eigenvectors. These are our two eigenvalues. Once we have computed the eigenvalues, we can select between four different methods to calculate the test statistic. Pillai's trace is calculated by the following formula. By using our eigenvalues, we see that the test statistic is calculated 0.78. Hotelling's trace, also called Hotelling's lawless trace, is simply the sum of the eigenvalues, which is 3.5 in our example. Wilk's lambda is calculated like this. And Roy's largest root is simply the biggest eigenvalue, which is the first eigenvalue. Pillai's test is generally considered to be the most robust test against violations of the assumptions behind MANOVA, and is therefore a common test to select. Once the test statistic has been selected, one can compute the F-statistic and the p-value. However, those calculations are not shown here. By using a statistical software tool, the following F-statistic and p-value were calculated based on our example data. This was the end of this lecture about MANOVA. In the next lecture, we'll discuss the Hotelling's t-square test, which can be used as an alternative to MANOVA when we have only two groups. The Hotelling's t-square test will produce the same results as the MANOVA for two groups. However, the Hotelling's t-square test is a lot easier to calculate by hand compared to the MANOVA.