Model Training and Evaluation Process
It’s modeling time! Now that you’ve analyzed your data and have a better understanding of its behavior, it’s time to try modeling your dataset and assessing the training error.
Summarize the results of your analysis in an Engineering Memo. Address at least the following:
-
What is the goal of your model? (i.e., what exactly would you like your model to be able to predict?)
-
Using no more than 75% of your collected data, train your model following the procedure you learned in Module 10. Visualize the training procedure by plotting the training error (the target prediction vs. # iterations). You will likely end up with something like the following:
If you undertrain your model, you will have significant training error and high bias towards the data that has already been trained. Consequently, the model itself is not very variable. As you add more data or model variance, the training error will decrease. However, adding too much data, or too much similar data, will result in a model which cannot easily predict any one specific target data point. Your goal then, is to hit the “sweet spot” in between these two extremes, where you get the best agreement (minimal training error).
-
Provide the finalized version of the model including its full functional form and the values of all coefficients formatted as an equation. , Label each independent variable and explain its relevance. Generate a finalized residuals scatter plot as well as a histogram. Does the data appear to be Gaussian-distributed?
-
What kinds of data/model biases are present in your finalized model? Explain their relevance.
PART-2:
At this point, you’ve trained your model using a subset of your collected data. However, the point of creating a model is not just to be descriptive (i.e., to explain existing data), but to be predictive—i.e., to take new data and see if you can reproduce it. You’ve actually already done this—a little bit—last week, during the model training. Now though, you will test your model against a larger dataset—i.e., the remaining untrained data, and assess the overall predictive capabilities of your model. APA