In a dataset, each measurable property (which usually appears in a column) is a variable. A dataset includes many variables, and can be categorized into:
- Input variables (features, predictors, independent variables): are used to infer about, or predict the output variables. Moreover, they can be used to learn and derive insights from the observations in the dataset. These variables are denoted as where i is the i-th variable.
- Output variable (response, dependent variable): the important variable we want to get inference or even the actual value from the input variables. It is denoted as .
For example, a House Price dataset. House Price is the output variable (we would like to predict/infer this value from other variables), whereas House Age, House Area, and Number of bedrooms are input variables.
More generally, suppose that we observe a quantitative response , and p predictors , then we could assume that there is a relationship between the response and the predictors.
Here, is a fixed unknown function and is the error term which has a mean of 0 and is independent of .
Our goal would be estimating the unknown function , since most of the time, we cannot get the correct function.
1. Why estimate
There are two reasons for estimating :
- Prediction: In many situations, the set of predictors can be easily obtain, but is not. Therefore, we must find some way to get or at least estimate . Since ’s mean is zero, we can predict using:
where represents our *estimate$ to , and is our estimate to .
The accuracy of depends on 2 quantities: reducible error and irreducible error.
- reducible error: In particular, since is an estimate of , there will be some errors. But these errors can be reduce by using statistical learning methods.
- irreducible error: Since the function has the error , and it is a random variable with the mean of zero, which means it is unpredictable. Therefore, no matter how well we estimate , we cannot reduce the error which cause by