Sampling theory and random sample

  • I
  • Thread starter fog37
  • Start date
  • Tags
    Sampling
  • #1
fog37
1,568
108
TL;DR Summary
sampling theory and Inference
In inferential statistics, we have a large population, collect data from it to get random sample of size ##n##, and infer the population parameters from that single sample.

I read that the random sample can be interpreted as the collection of the ##n## realizations of a single random variable ##X##. For example, the height ##H## of individuals in a population can be define as a random variable and the height of each individual in the random sample is a realization of the r.v. However, a more correct interpretation of a random sample is the following: each element of random sample, for example the 5 heights ##[6, 5.4, 6.1, 5.5, 6.4]##, is the realization of a different random variables. So the random sample is the realization of a random vector, a sequence of i.i.d. random variables ##[X_1, X_2, X_3, X_4, X_5]## with a joint probability distribution ##f(x_1, x_2, x_3, x_4, x_5)##. Why is this the correct interpretation of the random sample and not the first one with a single r.v.? Are the two interpretations somehow equivalent to each other? How?

When we perform regression analysis on some random sample of data, are we dealing with a pair of random variables, ##X## and ##Y##, i.e. a 2D random vector ##Z=(X,Y)##? Or with two random vectors, ##X=[X_1, X_2, X_3, X_4, X_5]## and ##Y= [Y_1, Y_2, Y_3, Y_4, Y_5]## where each value of x and each value of y are realizations of different random variable X and different random variable Y?

Thank you as always for any comment and correction.
 
Physics news on Phys.org
  • #2
fog37 said:
I read that the random sample can be interpreted as the collection of the ##n## realizations of a single random variable ##X##. For example, the height ##H## of individuals in a population can be define as a random variable and the height of each individual in the random sample is a realization of the r.v. However, a more correct
Is "more correct" your phrase or theirs? A restriction of the first interpretation is that the population distribution is assumed to be identical. If the intent is to study things like cluster analysis, importance sampling, or stratified sampling, then there is some freedom to say that there are more than one distribution involved in the sample.

CORRECTION: I missed the IID part of the description of the second interpretation. I see no practical difference between the two interpretations.
 
Last edited:
  • #3
FactChecker said:
Is "more correct" your phrase or theirs? A restriction of the first interpretation is that the population distribution is assumed to be identical. If the intent is to study things like cluster analysis, importance sampling, or stratified sampling, then there is some freedom to say that there are more than one distribution involved in the sample.
Well, I have found this interpretation in several places. For example:
1704940604794.png


The population is an infinite set of values drawn from a random variable ##X##. Sampling from a population is the same as repeatedly drawing new values from ##X##. A a random sample of size ##n## is a collection of individual draws from ##X##.

The point seems to be that ##n## independent draws from a random variable ##X## is equivalent to one draw of ##n## i.i.d. random variables ##X_1, X_2,....X_n## Is that really the case? Can you help me appreciate why the two scenarios are equivalent...
 
  • #4
Sorry. I missed the IID part of second interpretation. I see no practical difference between the two. So I wonder where you read that the second interpretation was better.
 
  • Like
Likes fog37
  • #5
FactChecker said:
Sorry. I missed the IID part of second interpretation. I see no practical difference between the two. So I wonder where you read that the second interpretation was better.
Thank you FactChecker for your support. Let me share with you this stats.stackexchange.com answer:
https://stats.stackexchange.com/questions/368492/about-sampling-and-random-variables/368517#368517

The response by shadowtalker is discussed how the 2nd interpretation allows for for the sample statistics to also be random variables, as they are...

So why are the two interpretations really identical? Would you mind sharing your thought process. It is the same random reality but described in two different ways...Is one more technically correct that the other? As mentioned, when we talk about regression analysis, it seems better to keep the random sample of data, each pair of ##x## and ##y## values, are realizations of two random variables ##X## and ##Y## instead of two sequences of random variables, one for the ##x## values and one of the ##y## values...

For example, in the case of tossing a die multiple times, the outcome of each toss is the realization of a single random variable OR are the outcomes are the realizations of different random variables...

Thank you!

Thank you!
 
  • #6
fog37 said:
Thank you FactChecker for your support. Let me share with you this stats.stackexchange.com answer:
https://stats.stackexchange.com/questions/368492/about-sampling-and-random-variables/368517#368517

The response by shadowtalker is discussed how the 2nd interpretation allows for for the sample statistics to also be random variables, as they are...
I agree. It is a distinction that I have probably been careless about in the past. There is a difference between a sample, which is an already collected set of data, versus the random variables the gave you that sample. I think it is standard to use lower case (##x_i##) for the data and upper case (##X_i##) for the random variables.
fog37 said:
So why are the two interpretations really identical? Would you mind sharing your thought process. It is the same random reality but described in two different ways...Is one more technically correct that the other?
IMO, one situation where the distinction is significant is if you talk about collecting data in stages so that some data is collected but other data is not yet collected and still a random variable. You might see this in stopping problems. Suppose that you were doing an experiment where collecting data was expensive or difficult and you need to decide if you should collect more data. Also, I think that the distinction would be significant in many Bayesian methods with prior and post distributions. Also bootstrap methods.
I have no real experience with these types of problems and will have to leave this discussion to others.
 

What is sampling theory?

Sampling theory is a study within statistics that provides guidelines and techniques for collecting samples from a larger population in a way that accurately represents the whole. It aims to infer properties or underlying truths about a population based on a smaller, manageable subset of data. The theory helps in determining the sample size, the method of sample selection, and analysis techniques to reduce bias and error in statistical conclusions.

What is a random sample?

A random sample is a subset of individuals chosen from a larger set (population) where each individual is selected randomly and entirely by chance, giving each individual an equal probability of being chosen at any stage during the sampling process. This method ensures that the sample is unbiased and representative of the entire population, making the statistical results more reliable.

Why is random sampling important in research?

Random sampling is crucial in research because it minimizes sampling bias, ensuring that each member of the population has an equal chance of being included in the sample. This representativeness helps researchers draw more accurate and generalizable conclusions about the population from which the sample is drawn. It enhances the validity and reliability of the research outcomes, making the findings applicable to a broader context.

How do you determine the appropriate sample size for a study?

Determining the appropriate sample size for a study involves considering several factors including the expected effect size, the power of the test (usually set at 80% or higher), the significance level (commonly set at 0.05), and the population variance. Statistical formulas and software are often used to calculate the sample size needed to achieve reliable and valid results. Researchers must balance practical constraints such as time, cost, and resources with these statistical requirements.

What are some common methods of random sampling?

Common methods of random sampling include simple random sampling, where each member of the population has an equal chance of being selected; stratified random sampling, where the population is divided into subgroups (strata) and random samples are taken from each stratum; and cluster sampling, where the population is divided into clusters but only a random selection of these clusters are chosen for sampling. Each method has its advantages and is chosen based on the research objectives and the structure of the population.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
897
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
553
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
454
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
475
  • Set Theory, Logic, Probability, Statistics
Replies
0
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
662
Back
Top