How to Learn Machine Learning? – The Matrix of Features and The Target Variable

Hey Guys! Welcome to the next episode of “How to Learn Machine Learning” tutorials. In the previous episode, we have learned how to create and process datasets. Additionally, we provided an introduction to dependent and independent variables. In case you have missed the first article in our series, I definitely recommend checking it out.

The Matrix of Features

Today I want you to understand a fundamental rule of Machine Learning, which is the difference between features and target. We need to distinguish two components which will frequently be used throughout our set of tutorials:

Matrix of features
Target Variable

Let’s understand what the matrix of features is. Open spyder and click on the data set.

The matrix of features is a term used in machine learning to describe the list of columns that contain independent variables to be processed, including all lines in the dataset. These lines in the dataset are called lines of observation.

matrix_of_features_diagram

We are going to create a matrix of features for the four independent variables in the ten line dataset above called ten lines of observations. What we have are:

Country
Age
Salary
Occupation

Let’s jump into creating our matrix of features.

Open spyder and type the following line of code:

x = dataset.iloc[:, :-1].values

You might be wondering, what did we just do here? We have declared a variable “x,” then used dataset variables called “dataset” and method “iloc” with specific parameters to select all the columns of independent variables. Note, “iloc” works on the positions in the index (so it only takes integers). The colon on the left means we have selected all of the lines of observations, whereas the colon on the right followed by integer -1, selects all of the columns with the exception of the last one. This is logical in our case because we want to select only the independent variables with their lines.

iloc_method_explained

We excluded the last column which contains our dependent variable. Here the .values is just part of the python syntax, meaning we want to get access to the values in the dataset we have selected. Let’s see what we receive as a result of this operation. Type “x” in the spyder console.

As you can see, we have selected the first four columns, while excluding the last one.

x_array_printed

For now, we are done with the selection of the matrix of features. We can move on to the next feature called “Target Variable Vector.”

TARGET VARIABLE VECTOR

The target variable vector is a term used in Machine Learning to define the list of dependent variables in the existing dataset. Here we also have lines of observations which is the list of those variables by line.

the_dependent_variables_vector

We are going to create the “dependent variables vector” which is the last column in our dataset, labeled self-employed, consisting of the ten lines of observations.

In spyder type the following line:

y = dataset.iloc[:, :4].values

Let’s go over this line of code. We have created a variable “y,” and used the same method ”iloc” that we used in the matrix of features. By using the left colon, we have selected all of the lines which we call lines of observations, whereas instead of using the right colon, this time we have to put the number 4, thereby selecting the last column with index 4 and position 5. Remember indexes in python, as in most scripting languages, start with 0.

dependent_variables_vector_select

This operation should have selected the last column which is our dependent variable, “self-employed,” and all of the 10 lines of observations.

To test this, type “y” in spyder console.

y_array_printed

As you should see, we have selected the dependent variables list with an output of “yes” or “no,” indicating whether a person is self-employed or not.

So that’s it with python. It’s now time to switch to R. With open studio, you’ll notice operations are much more straightforward because we do not have to make a distinction between the matrix of features, the dependent variable vector.

We will have to set a working directory in R as well. In R you will choose the files section at the bottom right of the screen, select your working directory as a file path, click more, and choose the “set as a working directory” option.

set_r_working_directory

If all is correct, you should see an output in the console.

setwd("~/Dev/AI/Intro to datasets")– of course with your working directory path.

working_directory_set

Now we are ready to start importing the dataset. To do this in R, you should follow these simple steps.

Create a new R file and name it “data_prep_draft.r.” Save it to the same working directory as the previously created records.

Now we will need just one more line of code. As with Python, we are going to call the variable that will be the dataset itself, “dataset,” and use a method read.csv to read and import the CSV file created earlier.

dataset = read.csv('self-employed.csv')

After performing this operation, you should be able to see the imported dataset.

imported_r_dataset

There are two clear distinctions that you should know:

Unlike python, indexes in R start at 1, so in our lines of observations, you should see ten lines, indexed from 1 to 10.
You do not have to programmatically mention the difference between the matrix of features and the dependent variables vector in R.

This will start making perfect sense, as we dive deeper into future tutorials. That’s it for today, stay tuned for the upcoming tutorials. I hope my explanations are clear and easy for you to understand. It is my wish to make the learning curve as easy as possible in this complex world of machine learning and data science.