COVID-19 Forecast
DATA 200 graduate project
US weekly COVID-19 forecast with engineered features
Analysis Source Code Notebook (Colab)
COVID-19 pandemic has created the largest public health crisis in decades. Since the outbreak, there have been tremendous interests in attempting to forecast the confirmed cases and the death tolls, and to predict the course of the pandemic, so as to better inform public health policies. In this project, we make use of a publicly available data repository on U.S. COVID-19 related statistics in the year 2020 and 2021 to build a model that forecasts the death tolls in each state in the next week. We engineered and selected time-lagged features, including how the virus spreads geographically between states using the Vector Autoregression model and using the proximity method, and experimented on several models including LASSO, Elastic Net, Random Forest, and Gradient Boosting Tree. In particular, we built a Ridge Regression model that achieves a 94% cross-validation R squared with informative interpretations on the various features contributing to the forecast. We hope that our model can be used in assisting the prediction of the course of the pandemic.
Below are our presentation slides and report.