The Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables
are linearly related. Its formal definition is:
$$\rho_{X,Y}=\frac{cov(X,Y)}{\sqrt{Var(x)\times Var(Y)}}$$
In a sample, the correaltion coefficient is called \(r_{xy}\) (\(r\) for simplicity) and can be computed as:
$$r_{xy}=\frac{1}{1-n} \sum_{i=1}^n \left( \frac{x_i-\bar x}{s_x}\right)\left( \frac{y_i-\bar y}{s_y}\right)$$
Is this exercise, you can indicate a desired value of \(\rho\) and the program obtains a random sample
corrsponding to two variables with this correlation.The red line indicates the true relationship corresponding to a line
with slope \(\beta_1=\rho \frac{\sigma_Y}{\sigma_X}\) and an intercept \(\beta_0=\mu_Y-\beta_1 \mu_X\).
Resulting model
Simulation of linear regression
True model
Sample
Regression plot
Regression analysis
Distribution of r values in samples of correlated variables
You can indicate a value of the population correlation \(\rho\)
for two variable. The program generates the especified number of
samples of the given size and computes the resulting correlation (r)
for each sample. The distribution of \(\rho\) indicates admissible
values due to random fluctuation of samples.
Results
In this simulation we show the result of several samples of a given size. As the size increases, the distribution
of r narrows around \(\rho\). For small samples, ths vaue fluctuates in a large area.
Analysis of datset fat (library UsingR)
In this exercise, we are using the dataset fat that comes with the package Using R.
You can select a pair of variables and observe its correlation. A linear model is
fitted in each case and the result of the corresponding fitting is shown.
In the heat map, the correaltion of all the variables is shown as a guide for selecting
variables to use. We can explore the use of several variables as predictors in the
Multivariate tab.
Multivariate analysis
Using the data set fat in the package UsingR, you can explore the use of several predictors
in a linear regresion model. As a variables to predict, we suggest body.fat.
As initial predictors, we suggest age and BMI.
Select inputs for the Dependent Variable
Select input for the Independent Variable
Comparing regression models in two groups
In this application, we explore the use of linear models with one
continous predictive variable (Age) and one factor (Sex). The
problem to solve is to decide if we need a model with interaction
(non-paralel regression lines for each sex), a model with effect
of Age and Sex but no interaction (parallel lines), or a model without
the variable Sex (same line for both sexes.
You can specify the theoretical situation by indicating the true
model for both sexes and the error term \(\sigma\). The
application generates random samples and you can compare the
outcome of fitting the generated data and the true models.
Some questions worth to explore:
(1) Which is the effect of increasing the variability?
(2) Can you always reach a conclusion that matches the true generating models?
(3) How does the sample sizes affect the analyses?
Models can be compared for assessing if adding terms produce a better
explanation of the observed variability in the data.
If the model with more components has a low p-value, then we can
conclude that this model reduces the variability with respect the less
comple model. If the p-value is high, it indicates that the added
comlexity may no be needed.
Results
(c) Albert Sorribas, Ester Vilaprino, Montse Rue, Rui Alves.
Biomodels Grup, Departament de Ciencies Mediques Basiques.
Universitat de Lleida, Institut de Recerca Biomedica de Lleida (IRBLleida