Comparing two multivariate distributions (two matrices)

rayms · Nov 11, 2011

I urgently need some help in my problem for my MS thesis. I have two datasets of same variable dimension but different number of observations, ie same # of columns but not same # rows. The variables are indentical for both sets. I want to compare the multivariate distributions of the two data sets. I have done some google research on the matter and all I could find are tests for normality of multivariate samples. Although that information is also useful, i am more interested in comparing my two datasets whatever their distrubutions maybe. In what way should i compare them? What are the parameters of comparison?. In a mutivariate normal distribution, I have read from an old paper (1983) by Hannu Oja that the eigenvalues of the covariane matrix is a measure of spread or scatterness of the data, or so i understood it that way. Please comfirm this if i am right or wrong. This is as far as my search for answers could go. I want to have paramenters of comparison even if my data sets are not normal. Since I am dealing with two matrices, I also welcome suggestions from the mathematical point of view as well. Thank you very much in advance for any and all suggestions.
Rayms

bpet · Nov 11, 2011

I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).

rayms · Nov 12, 2011

bpet said:

I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).

Thanks for the reply bpet. I was already assuming nobody cares and nobody reads my thread. But I exagerrate. The problem is I cannot reduce the dimension of my data anymore, it is already reduced from the original. I guess I have to be more specific what I am using the data for. The data sets will be used to come up with regression models. One of my hypotheses is that the validity or predictive accuracy of the models must lie on the internal structure of the data sets used. This internal structure can be described by their distrubutions and other mathematical properties. In other words, I am trying to look at diffrences in data structure of the two sets and relate these differences in the resulting models´performance.

bpet · Nov 12, 2011

Ok, probably the easiest way is first to compare the marginals and then the joint distribution.

The marginals (i.e. individual variables) can be compared using the usual univariate nonparametric two-sample tests (KS, AD, CvM, MW etc).

If no significant differences are found in the marginals, and if the marginals are continuous, you could descale the data by converting the rank order to percentiles. Then try some graphical tools (such as parallel coordinates, andrews plot, scatter matrices etc). There also exist several multivariate distribution-free two sample tests but I don't know a lot about that area.

Also the above assumes that your data sets consist of IID observations, e.g. for time series models other methods might be more suitable.

HTH

blue_raver22 · Nov 19, 2011

I understand your urgency in finding a solution for your thesis problem. Comparing two multivariate distributions is a complex task and requires careful consideration of various factors.

Firstly, it is important to determine the purpose of your comparison. Are you looking to assess the similarity or difference between the two datasets? Are you interested in identifying potential outliers or patterns within the data? Once you have a clear objective, you can then choose an appropriate statistical method for comparison.

One approach could be to use a multivariate analysis of variance (MANOVA) test, which allows for the comparison of multiple variables between two groups. This test also does not assume normality of the data, as it uses rank-based methods. Another option could be to use a non-parametric test such as the Kolmogorov-Smirnov test, which compares the cumulative distribution functions of the two datasets.

In terms of parameters of comparison, you could consider measures such as mean, median, standard deviation, and range for each variable in the two datasets. Additionally, you could also look at measures of association, such as correlation coefficients, to assess the relationship between the variables in the two datasets.

Regarding your question about the eigenvalues of the covariance matrix, it is correct that they can be used as a measure of spread or variability in the data. However, this approach may be limited as it only considers the covariance between variables and does not take into account other aspects of the data.

In addition to statistical methods, you could also explore visualization techniques such as scatter plots, box plots, and parallel coordinate plots to compare the distributions of the two datasets.

Overall, it is important to carefully consider your research question and choose an appropriate method for comparison. I hope this information helps and wish you the best of luck with your thesis.

Comparing two multivariate distributions (two matrices)

Related to Comparing two multivariate distributions (two matrices)

1. What is the purpose of comparing two multivariate distributions?

2. How do you compare two multivariate distributions?

3. What are some common measures used to compare two multivariate distributions?

4. Can two multivariate distributions with different dimensions be compared?

5. How can comparing two multivariate distributions be useful in scientific research?

Similar threads

Hot Threads

Recent Insights