In a dataset, each measurable property (which usually appears in a column) is a variable. A dataset includes many variables, and can be categorized into:

  • Input variables (features, predictors, independent variables): are used to infer about, or predict the output variables. Moreover, they can be used to learn and derive insights from the observations in the dataset. These variables are denoted as XiX_i where i is the i-th variable.
  • Output variable (response, dependent variable): the important variable we want to get inference or even the actual value from the input variables. It is denoted as YY.

For example, a House Price dataset. House Price is the output variable (we would like to predict/infer this value from other variables), whereas House Age, House Area, and Number of bedrooms are input variables.

More generally, suppose that we observe a quantitative response YY, and p predictors X=(X1,,Xp)X=(X_1, \dots, X_p), then we could assume that there is a relationship between the response and the predictors.

Y=f(X)+εY = f(X) + \varepsilon

Here, ff is a fixed unknown function and ε\varepsilon is the error term which has a mean of 0 and is independent of XX.

Our goal would be estimating the unknown function ff, since most of the time, we cannot get the correct function.

1. Why estimate ff

There are two reasons for estimating ff:

  • Prediction: In many situations, the set of predictors XX can be easily obtain, but YY is not. Therefore, we must find some way to get YY or at least estimate YY. Since ε\varepsilon’s mean is zero, we can predict YY using:
Y^=f^(X)\hat{Y} = \hat{f}(X)

where f^\hat{f} represents our *estimate$ to ff, and Y^\hat{Y} is our estimate to YY.

The accuracy of Y^\hat{Y} depends on 2 quantities: reducible error and irreducible error.

  • reducible error: In particular, since f^\hat{f} is an estimate of ff, there will be some errors. But these errors can be reduce by using statistical learning methods.
  • irreducible error: Since the function Y=f(X)+ϵY = f(X) + \epsilon has the error ϵ\epsilon, and it is a random variable with the mean of zero, which means it is unpredictable. Therefore, no matter how well we estimate ff, we cannot reduce the error which cause by ϵ\epsilon