A Beginner’s Guide to Your First Linear Regression

When I say beginner, I don’t just mean you.

all photos courtesy of pixabay.com

One of the greatest obstacles to learning a technical subject, is finding a teacher that can relate the new information to you in a way that is accessible. If you have ever taken a calculus based physics class, or tried to watch a programming tutorial on YouTube, you probably noticed that these brilliant instructors aren’t always the best at this. They may throw out concepts and terms that are (for now!) way over our heads.

This quick lesson is for beginners, by a beginner, and will hopefully serve as a great intro to running linear regressions in Python.

Let’s get started by importing a few libraries:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

Don’t feel too overwhelmed if you haven’t seen a few of these libraries — each one serves an important role, and we’ll get into them as we progress.

Next, we’re going to read in a data set using the Pandas library, immediately assigning it to an arbitrary variable; in this case, “df”:

df = pd.read_csv('./datasets/sacramento_real_estate_transactions.csv')

If you would like to code along, which I recommend, this dataset can be found here. Once you have your dataframe read-in with Pandas, we need to explore what we are looking at.

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 street 985 non-null object
1 city 985 non-null object
2 zip 985 non-null int64
3 state 985 non-null object
4 beds 985 non-null int64
5 baths 985 non-null int64
6 sq__ft 985 non-null int64
7 type 985 non-null object
8 sale_date 985 non-null object
9 price 985 non-null int64
10 latitude 985 non-null float64
11 longitude 985 non-null float64
dtypes: float64(2), int64(5), object(5)
memory usage: 92.5+ KB

Examining the Non-Null Count, we quickly see there aren’t any null values. We can also see that, under Dtype, we have a mixture of numeric and non-numeric data — and that is okay. We are going to keep it simple, and make our predictions without using all of the features.

Next, take the time to explore what you think might be relevant using Pandas. For example, check out the stats on square footage:


Or, take a glance at our correlations:


After exhausting your Pandas tools, we would normally move on to data visualization, but that not being the goal of this lesson, I’ll leave you with one simple graph for inspiration when you come back to this tutorial, and want to throw some new skills at it.

sns.scatterplot(x = df['price'], y = df['sq__ft'], 
hue = df['sq__ft'] ).set_title('Square footage vs price');

Next up, we need to define a few variables. Because we don’t have any scaling or encoding tools in our toolbox yet, we are going to keep it simple, and try to predict house price using just one numeric feature. Let’s go with square footage, since our pretty graph above is showing a clear linear relationship. Also, as I am sure you guessed, our target variable is going to be “price.”

Our variables are going to look like this:

x = df[['sq__ft']] #note: with more than one variable, we would use
y = df['price'] # a capital X, and our double brackets store
# our variable as a dataframe.

Now on to the fun — let’s instantiate our linear regression. For this example, we will be using Sci-kit Learn’s Linear Regression. There are a few others to choose from (or if you were a math major you could do it by hand), but it’s a good idea to start getting used to Sci-kit Learn’s modeling tools.

lr = LinearRegression()

Instantiating our regression is as simple as that! Next, we fit it to our variables:

lr.fit(x, y)

You just finished your first linear regression model! Not too bad, right? There are plenty of ways to evaluate what we just did, but for now we will keep it simple, and print out the most common metric for scoring a regression: the r-squared. Before we can do that, though, we have to create a variable that will house our price predictions:

y_pred = lr.predict(x)

Now, we use those predictions to calculate our score, again using Sci-kit learn tools:

r2_score(y, y_pred)

If you followed along closely, you should have scored around 0.48. What does that mean? It means that, all else held constant, the square footage alone is able to predict up to 48% of a house’s value, and that you didn’t do too bad for a first regression!

There is plenty more we could do to improve our model — encoding our non-numeric data, making a heatmap of correlations to choose a few more features, scale our data, try different regression models, tune those model’s parameters … the list goes on and on! But, we have created a great first model to start building our skills off of.

Join me next week, as we start digging more into this prediction, adding some of the tools we talked about, and growing our skills as data scientists.

Aspiring Data Scientist and student at General Assembly.