Analyzing Principal Components and Formulations
- Which are the two most important principal components? How can you tell, using one of your visualizations?, Does it seem reasonable to use only two principal components in the feature vector? Why or why not?
The two most important principal components are typically the first two components as they capture the most variance in the data. We can determine this from the barplot of explained variance ratio where the larger values correspond to more important components. If the cumulative explained variance of the first two components is high (e.g., more than 70-80%), then it might be reasonable to use only two principal components in the feature vector. However, if a significant portion of the variance remains unexplained, additional components may be necessary for a more comprehensive representation of the data.
- Are there any apparent trends, correlations, or clusters among the recast data? Explain your answer(s). Are there any outliers?
This can be observed from the scatter plot of the data onto the two largest principal components. Trends, correlations, or clusters may indicate similarities or differences among formulations. If points cluster together or follow a clear trend, it suggests correlations or similarities among formulations. Outliers, on the other hand, are formulations that deviate significantly from the general trend or clusters. To determine this, one needs to visually inspect the scatter plot and identify any patterns or anomalies.
3. Imagine now that you had a new formulation, and you plotted it on the same axes as in 2.) above. The point lands at (-6, 5) based on the two most important PCs. Would you characterize this formulation as similar to, or very different from the existing formulations? Explain your answer.
To determine whether the new formulation is similar or different from existing formulations, we need to compare its position on the scatter plot to the positions of existing formulations. If the new formulation falls close to existing clusters or trends, it may be considered similar to existing formulations. However, if it is far from existing clusters and trends, it may be considered different. In this case, the new formulation lands at (-6, 5) based on the two most important PCs. Its similarity or difference would depend on the distribution of existing formulations in the scatter plot. If (-6, 5) is close to existing clusters, the formulation may be considered similar. Otherwise, it may be considered different. APA