The Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related. Its formal definition is: $$\rho_{X,Y}=\frac{cov(X,Y)}{\sqrt{Var(x)\times Var(Y)}}$$ In a sample, the correaltion coefficient is called \(r_{xy}\) (\(r\) for simplicity) and can be computed as: $$r_{xy}=\frac{1}{1-n} \sum_{i=1}^n \left( \frac{x_i-\bar x}{s_x}\right)\left( \frac{y_i-\bar y}{s_y}\right)$$

Is this exercise, you can indicate a desired value of \(\rho\) and the program obtains a random sample corrsponding to two variables with this correlation.The red line indicates the true relationship corresponding to a line with slope \(\beta_1=\rho \frac{\sigma_Y}{\sigma_X}\) and an intercept \(\beta_0=\mu_Y-\beta_1 \mu_X\).

Resulting model

You can indicate a value of the population correlation \(\rho\) for two variable. The program generates the especified number of samples of the given size and computes the resulting correlation (r) for each sample. The distribution of \(\rho\) indicates admissible values due to random fluctuation of samples.

In this simulation we show the result of several samples of a given size. As the size increases, the distribution of r narrows around \(\rho\). For small samples, ths vaue fluctuates in a large area.

In this exercise, we are using the dataset fat that comes with the package Using R. You can select a pair of variables and observe its correlation. A linear model is fitted in each case and the result of the corresponding fitting is shown. In the heat map, the correaltion of all the variables is shown as a guide for selecting variables to use. We can explore the use of several variables as predictors in the Multivariate tab.

Using the data set fat in the package UsingR, you can explore the use of several predictors in a linear regresion model. As a variables to predict, we suggest body.fat. As initial predictors, we suggest age and BMI.

In this application, we explore the use of linear models with one continous predictive variable (Age) and one factor (Sex). The problem to solve is to decide if we need a model with interaction (non-paralel regression lines for each sex), a model with effect of Age and Sex but no interaction (parallel lines), or a model without the variable Sex (same line for both sexes. You can specify the theoretical situation by indicating the true model for both sexes and the error term \(\sigma\). The application generates random samples and you can compare the outcome of fitting the generated data and the true models.

Some questions worth to explore:

(1) Which is the effect of increasing the variability?

(2) Can you always reach a conclusion that matches the true generating models?

(3) How does the sample sizes affect the analyses?

If the model with more components has a low p-value, then we can conclude that this model reduces the variability with respect the less comple model. If the p-value is high, it indicates that the added comlexity may no be needed.