4.4.2 Linear Relationship

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables.

A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables, then fitting a linear regression model to the data probably will not provide a useful model.

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other, but that there is some significant association between the two variables.

There are two main functions that can visualize a linear relationship as determined through regression. regplot()performs a simple linear regression model fit and plot. lmplot() combines regplot() and FacetGrid. Compared to the first one,lmplot() is more computationally intensive and is intended as a convenient interface to fit regression models across conditional subsets of a dataset.

1. Regplot

import seaborn as sns
tips = sns.load_dataset("tips")
sns.regplot(x = 'total_bill', y = 'tip',data = tips)

2. Lmplot

sns.lmplot(x="total_bill", y="tip", data=tips)

It’s also possible to fit a linear regression when one of the variables takes discrete values. One option is to add some random noise (“jitter”) to the discrete values to make the distribution of those values more clear. The other is to collapse over the observations in each discrete bin to plot an estimate of central tendency along with a confidence interval.

sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean)

3. Conditioning Linear Regression

How does the relationship between these two variables change as a function of a third variable?

While regplot() always shows a single relationship, lmplot() combines regplot() with FacetGrid to provide an easy interface to show a linear regression on “faceted” plots that allow you to explore interactions with up to three additional categorical variables.

The best practice is to plot both levels on the same axes and to use color to distinguish them.

sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips,
           markers=["o", "x"])

To add another variable, we can draw multiple “facets” which each level of the variable appearing in the rows or columns of the grid.

sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)

sns.lmplot(x="total_bill", y="tip", hue="smoker",
           col="time", row="sex", data=tips)

In the figure below, the two axes don’t show the same relationship conditioned on two levels of a third variable; rather, PairGrid() is used to show multiple relationships between different pairings of the variables in a dataset.

We can use height and aspect to control the height and width of each facet.

sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"],
             height=5, aspect=.8, kind="reg")

sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"],
             hue="smoker", height=5, aspect=.8, kind="reg")

Previous4.4.1 Scatter Plot Next4.4.3 Heatmap

Last updated 4 years ago

Was this helpful?