Linear regression

The Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related. Its formal definition is: $$\rho_{X,Y}=\frac{cov(X,Y)}{\sqrt{Var(x)\times Var(Y)}}$$ In a sample, the correaltion coefficient is called $r_{xy}$ ($r$ for simplicity) and can be computed as: $$r_{xy}=\frac{1}{1-n} \sum_{i=1}^n \left( \frac{x_i-\bar x}{s_x}\right)\left( \frac{y_i-\bar y}{s_y}\right)$$

Is this exercise, you can indicate a desired value of $\rho$ and the program obtains a random sample corrsponding to two variables with this correlation.The red line indicates the true relationship corresponding to a line with slope $\beta_1=\rho \frac{\sigma_Y}{\sigma_X}$ and an intercept $\beta_0=\mu_Y-\beta_1 \mu_X$.

Correlation $ (\rho) $

Sample size

$\mu_X$

$\sigma_X$

$\mu_Y$

$\sigma_Y$

Resulting model

Range of X axis

Range of Y axis

Simulation of linear regression

True model

Intercept (a):

slope (b)

Sample

Sample size:

Variability (sigma)

Grahic type:

Points

Points and regression line

Complete with confidence intervals

Regression line and true model

Residuals

Scale y min:

Scale y max:

Regression plot

Regression analysis

Distribution of r values in samples of correlated variables

You can indicate a value of the population correlation $\rho$ for two variable. The program generates the especified number of samples of the given size and computes the resulting correlation (r) for each sample. The distribution of $\rho$ indicates admissible values due to random fluctuation of samples.

Number of points per sample (Sample size):

Number of samples:

Correlation $ (\rho) $

Results

In this simulation we show the result of several samples of a given size. As the size increases, the distribution of r narrows around $\rho$. For small samples, ths vaue fluctuates in a large area.

Number of bins:

Analysis of datset fat (library UsingR)

In this exercise, we are using the dataset fat that comes with the package Using R. You can select a pair of variables and observe its correlation. A linear model is fitted in each case and the result of the corresponding fitting is shown. In the heat map, the correaltion of all the variables is shown as a guide for selecting variables to use. We can explore the use of several variables as predictors in the Multivariate tab.

Dependent variable (Y):

Predictor variable (X):

Multivariate analysis

Using the data set fat in the package UsingR, you can explore the use of several predictors in a linear regresion model. As a variables to predict, we suggest body.fat. As initial predictors, we suggest age and BMI.

Select inputs for the Dependent Variable

Dependent Variables

Select input for the Independent Variable

Independent Variables

Comparing regression models in two groups

In this application, we explore the use of linear models with one continous predictive variable (Age) and one factor (Sex). The problem to solve is to decide if we need a model with interaction (non-paralel regression lines for each sex), a model with effect of Age and Sex but no interaction (parallel lines), or a model without the variable Sex (same line for both sexes. You can specify the theoretical situation by indicating the true model for both sexes and the error term $\sigma$. The application generates random samples and you can compare the outcome of fitting the generated data and the true models.

Some questions worth to explore:
(1) Which is the effect of increasing the variability?
(2) Can you always reach a conclusion that matches the true generating models?
(3) How does the sample sizes affect the analyses?

True model for males

Intercept at 55 years

Slope

Sample size

True model for females

Intercept at 55 years

Slope

Sample size

Variability from the model

Value of $\sigma_\epsilon$ (Males)

Value of $\sigma_\epsilon$ (Females)

Fit models
Compare models

Select a model

Only Age

Age and Sex, no interaction

Age, Sex, and interaction

Fitted model:

Interpretation

Models can be compared for assessing if adding terms produce a better explanation of the observed variability in the data.
If the model with more components has a low p-value, then we can conclude that this model reduces the variability with respect the less comple model. If the p-value is high, it indicates that the added comlexity may no be needed.

Results

(c) Albert Sorribas, Ester Vilaprino, Montse Rue, Rui Alves. Biomodels Grup, Departament de Ciencies Mediques Basiques. Universitat de Lleida, Institut de Recerca Biomedica de Lleida (IRBLleida

Simulation of linear regression

True model

Sample

Regression plot

Regression analysis

Distribution of r values in samples of correlated variables

Results

Analysis of datset fat (library UsingR)

Multivariate analysis

Select inputs for the Dependent Variable

Select input for the Independent Variable

Comparing regression models in two groups

True model for males

True model for females

Variability from the model if (window.MathJax) MathJax.Hub.Queue(["Typeset", MathJax.Hub]);

Fitted model:

Interpretation

Results

(c) Albert Sorribas, Ester Vilaprino, Montse Rue, Rui Alves. Biomodels Grup, Departament de Ciencies Mediques Basiques. Universitat de Lleida, Institut de Recerca Biomedica de Lleida (IRBLleida

Variability from the model