The purpose of this project is to delve into the data science pipeline, all the way from Data Collection to Insight & Policy Decision. To accomplish this, we will be taking a look at COVID-19 data, specifically case, death, and vaccination data.
Vaccines have been a subject of much controversy for a long time, and the COVID-19 vaccine has resurfaced this issue into even greater prominence. Many people in the United States have questioned the effectiveness of the vaccine and suggested that it does more harm than good. In fact, the politics of vaccination have become even more important than demographics. Surveying the population has revealed that more than 90% of Democrats are vaccinated, compared to just 58% of Republicans. 23% of Republicans say that they will "definitely not" be getting the vaccine.
Now that we have lived with COVID-19 for a few years, we have aggregated plenty of data here in the United States, and the goal of this tutorial is to determine if the data we have shows a connection between the number of COVID-19 vaccinations and the number of COVID-19 cases and deaths.
My hypothesis is as follows:
The increase in COVID-19 vaccinations over time will result in a decrease to the rate at which new cases of COVID-19 are recorded.
During this stage of the data science pipeline, we need to find data that will aid us in our testing and analysis. We will be using the pandas library to convert this online data into something that is usable for all of our data-science-related tasks.
# Data science libraries for python
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Extra functionality for jupyter notebook formatting
from IPython.display import display, HTML
There are two main sources of data and one supplementary source of data that we will need.
The first main source is a collection of COVID-19 case and death data from the CDC. This will supply us with all of the information we need about cases and deaths. More specifically, it contains aggregated totals for all cases and deaths, on which day the totals were added, and from which U.S. state the totals came from. We will convert this csv file into a pandas DataFrame named df_cd
.
The second main source is a collection of COVID-19 vaccine data from USAFACTS. This contains everything we need to know about COVID-19 vaccine data over the last few years. This includes the number of partially vaccinated people, the number of fully vaccinated people, the date that each data entry was aggregated, the U.S. state that the data entry originates from, and population data for that respective state. We will convert this csv file into a pandas DataFrame named df_v
.
The supplementary source of data is a list of U.S. states (and the District of Columbia) and their respective abbreviations. This data will come in handy during the data cleaning process, as the U.S. state information between the two main tables does not match, and converting everything into the abbreviations will make handling the data a significantly more streamlined process. We will convert this csv file into a pandas DataFrame named df_states
.
df_cd = pd.read_csv('Weekly_United_States_COVID-19_Cases_and_Deaths_by_State.csv')
df_v = pd.read_csv('COVID19_CDC_Vaccination_CSV_Download.csv')
df_states = pd.read_csv('https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv')
The following code generates a 5-row preview of each data table that we just collected. As you can see, there is a heap of helpful data at our fingertips, but there is also a large amount of inconsistency between the tables and unnecessary data. We will resolve this in the next stage of the data science pipeline: Data Processing.
# display() simply allows back-to-back previews of DataFrames, all from one jupyter codeblock
display(df_cd.head())
display(df_v.head())
display(df_states.head())
date_updated | state | start_date | end_date | tot_cases | new_cases | tot_deaths | new_deaths | new_historic_cases | new_historic_deaths | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 01/23/2020 | AK | 01/16/2020 | 01/22/2020 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 01/30/2020 | AK | 01/23/2020 | 01/29/2020 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 02/06/2020 | AK | 01/30/2020 | 02/05/2020 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 02/13/2020 | AK | 02/06/2020 | 02/12/2020 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 02/20/2020 | AK | 02/13/2020 | 02/19/2020 | 0 | 0 | 0 | 0 | 0 | 0 |
DATE | GEOGRAPHY_LEVEL | GEOGRAPHY_NAME | DEMOGRAPHIC_GROUP | DEMOGRAPHIC_CATEGORY | PARTIALLY_OR_FULLY_VACCINATED_PERSONS | FULLY_VACCINATED_PERSONS | POPULATION | PARTIALLY_OF_FULLY_VACCINATED_PERCENT | FULLY_VACCINATED_PERCENT | TOTAL_DOSES_ADMINISTERED | TOTAL_DOSES_DISTRIBUTED | SOURCEINFO | USAFACTS_INGESTION_DATE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-04-26 | Nation | United States | Total | Total | 270047396.0 | 230533196.0 | 332008832.0 | 0.813374 | 0.694359 | 675442636.0 | 979617855.0 | CDC State and National Vaccination Data | 2023-05-01T01:25:24.415Z |
1 | 2023-04-26 | State | California | Total | Total | 33596928.0 | 29580494.0 | 39512223.0 | 0.850292 | 0.748642 | 88296470.0 | 120437235.0 | CDC State and National Vaccination Data | 2023-05-01T01:25:24.415Z |
2 | 2023-04-26 | State | Arkansas | Total | Total | 2114069.0 | 1719511.0 | 3017804.0 | 0.700532 | 0.569789 | 4867436.0 | 8343140.0 | CDC State and National Vaccination Data | 2023-05-01T01:25:24.415Z |
3 | 2023-04-26 | State | Arizona | Total | Total | 5699178.0 | 4819349.0 | 7278717.0 | 0.782992 | 0.662115 | 14616367.0 | 19864230.0 | CDC State and National Vaccination Data | 2023-05-01T01:25:24.415Z |
4 | 2023-04-26 | State | Alabama | Total | Total | 3192073.0 | 2610908.0 | 4903185.0 | 0.651020 | 0.532492 | 7011237.0 | 12305740.0 | CDC State and National Vaccination Data | 2023-05-01T01:25:24.415Z |
State | Abbreviation | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
It is now time to clean up the data and prepare it for analysis. This will mainly involve identifying issues with the data and finding missing data, then deciding how to handle it.
First thing's first: The vaccination data has a lot of NULL data that we do not want. We can easily get rid of that by dropping rows with any NaN's in them.
df_v = df_v.dropna()
Now that the NaN data is massively reduced, we'll start by examining the list of states that both of the datasets include. To do so, I will print the DataFrame column that contains the state and only include unique values.
print(df_cd.state.unique())
print()
print(df_v.GEOGRAPHY_NAME.unique())
['AK' 'AL' 'AR' 'AS' 'AZ' 'CA' 'CO' 'CT' 'DC' 'DE' 'FL' 'FSM' 'GA' 'GU' 'HI' 'IA' 'ID' 'IL' 'IN' 'KS' 'KY' 'LA' 'MA' 'MD' 'ME' 'MI' 'MN' 'MO' 'MP' 'MS' 'MT' 'NC' 'ND' 'NE' 'NH' 'NJ' 'NM' 'NV' 'NY' 'NYC' 'OH' 'OK' 'OR' 'PA' 'PR' 'PW' 'RI' 'RMI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VA' 'VI' 'VT' 'WA' 'WI' 'WV' 'WY'] ['United States' 'California' 'Arkansas' 'Arizona' 'Alabama' 'Delaware' 'Colorado' 'Alaska' 'Connecticut' 'Georgia' 'Idaho' 'Hawaii' 'Florida' 'Illinois' 'Iowa' 'District of Columbia' 'Indiana' 'Kansas' 'Maine' 'Kentucky' 'Louisiana' 'Michigan' 'Maryland' 'Minnesota' 'Mississippi' 'Massachusetts' 'Missouri' 'Nebraska' 'Montana' 'New Hampshire' 'New York State' 'Nevada' 'New Mexico' 'New Jersey' 'North Dakota' 'North Carolina' 'Oregon' 'Oklahoma' 'Ohio' 'Pennsylvania' 'South Carolina' 'Rhode Island' 'Texas' 'Tennessee' 'Utah' 'South Dakota' 'Vermont' 'Virginia' 'Washington' 'Wyoming' 'Wisconsin' 'West Virginia' 'Northern Mariana Islands' 'Virgin Islands' 'Puerto Rico' 'Guam' 'American Samoa' 'Federated States of Micronesia' 'Marshall Islands' 'Indian Health Svc' 'Republic of Palau']
There are already a couple issues that jump out at us. First of all, they are not in the same format. One of them is abbreviated, and one lists the full proper name. For the sake of simplicity and convenience, we will end up converting the proper names to abbreviations.
Before we can do that, we need to solve the other big issue, which is the presence of categories and territories that are not one of the 50 U.S. States. As it turns out, there is COVID-19 data collected for U.S. territories like Puerto Rico or the Federated States of Micronesia. However, it seems that none of the usual states are missing, so we will have more than enough data to work with if we simply focus on the 50 official U.S. states and Washington DC. To do so, we will simply only include states that have a match in our abbreviation mapping table df_states
.
BUT WAIT! One of the states has something strange going on: New York. First of all, in
df_v
it is represented as "New York State" instead of "New York", as it is indf_states
. We will fix this through a simple pandas replacement.In addition,
df_cd
has both "NY" and "NYC". The dataset's website has an explanation to this discrepancy: "New York State’s reported case and death counts do not include New York City’s counts as they separately report nationally notifiable conditions to CDC." Since New York and New York City are represented as separate jurisdictions in one dataset, we will need to combine them into just "NY".
# Rename so that it can interact with df_states
df_v = df_v.replace({'GEOGRAPHY_NAME': 'New York State'}, 'New York')
# Make everything classified as NYC just show up as NY. Note: this means that NY has 2 entries for each week instead of 1.
df_cd = df_cd.replace({'state': 'NYC'}, 'NY')
# Exclude everything except data for states in df_states
df_cd = df_cd[df_cd.state.isin(df_states.Abbreviation)]
df_v = df_v[df_v.GEOGRAPHY_NAME.isin(df_states.State)]
The vaccination data provides extra entries that split the data by demographics. For the sake of this tutorial, we are not interested in the demographics, so we will remove them.
df_v = df_v[(df_v.DEMOGRAPHIC_GROUP == 'Total') & (df_v.DEMOGRAPHIC_CATEGORY == 'Total')]
Some renaming, reorganizing, and dropping is required in order to clean up the data.
For df_cd
we only need end_date, as we only care about aggregated totals.
For df_v
, we only need information about population for each state and how many people got partially or fully vaccinated. We will remove everything else.
# Drop columns that are unnecessary for analysis
df_cd = df_cd.drop(['date_updated', 'start_date'], axis=1)
df_v = df_v.drop(['GEOGRAPHY_LEVEL', 'DEMOGRAPHIC_GROUP', 'DEMOGRAPHIC_CATEGORY', 'TOTAL_DOSES_ADMINISTERED', 'TOTAL_DOSES_DISTRIBUTED', 'SOURCEINFO', 'USAFACTS_INGESTION_DATE'], axis=1)
# Rename columns for clarity and consistency
df_cd = df_cd.rename(columns={'end_date': 'date'})
df_cd = df_cd[['date', 'state', 'tot_cases', 'new_cases', 'tot_deaths', 'new_deaths', 'new_historic_cases', 'new_historic_deaths']]
df_v = df_v.rename(columns={'DATE': 'date',
'GEOGRAPHY_NAME': 'state',
'PARTIALLY_OR_FULLY_VACCINATED_PERSONS': 'partial_or_full_vaccination',
'FULLY_VACCINATED_PERSONS': 'full_vaccination',
'POPULATION': 'pop',
'PARTIALLY_OF_FULLY_VACCINATED_PERCENT': 'partial_or_full_vaccination_pct',
'FULLY_VACCINATED_PERCENT': 'full_vaccination_pct'})
Finally, we can remap the full proper state names in the vaccination data to their respective abbreviations outlined in df_states
.
# Convert every proper state name to the abbreviation in the same row for each state in df_states
df_v.state = df_v.state.map(df_states.set_index('State').Abbreviation)
It is good practice to convert any dates to datetime objects. This facilitates things like sorting, resampling, and aggregation over periods of time.
df_cd.date = pd.to_datetime(df_cd.date)
df_v.date = pd.to_datetime(df_v.date)
Now that both tables are organized and clean, we can combine them. We will use an inner merge, because we only want to consider days where we have all the data we need to make an analysis.
NOTE: This will remove a lot of data from
df_v
, becausedf_v
has daily information whiledf_cd
has weekly information, and we are only including common days between the two. In our case, this is correct and exactly what we want, but great caution should be taken during the merging process to ensure that you do not lose any essential data.
df = pd.merge(df_cd, df_v, 'inner', on=['date', 'state'], copy=True)
And now to see what our clean DataFrame looks like...
df.head()
date | state | tot_cases | new_cases | tot_deaths | new_deaths | new_historic_cases | new_historic_deaths | partial_or_full_vaccination | full_vaccination | pop | partial_or_full_vaccination_pct | full_vaccination_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-01-20 | AK | 51178 | 1377 | 253 | 27 | 0 | 0 | 56911.0 | 11283.0 | 731545.0 | 0.077796 | 0.015424 |
1 | 2021-02-03 | AK | 53305 | 946 | 279 | 18 | 0 | 0 | 98076.0 | 27528.0 | 731545.0 | 0.134067 | 0.037630 |
2 | 2021-02-10 | AK | 54294 | 989 | 280 | 1 | 0 | 0 | 112448.0 | 43552.0 | 731545.0 | 0.153713 | 0.059534 |
3 | 2021-02-17 | AK | 55101 | 807 | 288 | 8 | 0 | 0 | 130751.0 | 62474.0 | 731545.0 | 0.178733 | 0.085400 |
4 | 2021-02-24 | AK | 55986 | 885 | 289 | 1 | 0 | 0 | 156027.0 | 87061.0 | 731545.0 | 0.213284 | 0.119010 |
Fantastic! Everything is clear, legible, and easy to work with.
To make things even better, we don't have any lingering NaN data:
# Display any row that contains a NaN value
df[df.isna().any(axis=1)]
date | state | tot_cases | new_cases | tot_deaths | new_deaths | new_historic_cases | new_historic_deaths | partial_or_full_vaccination | full_vaccination | pop | partial_or_full_vaccination_pct | full_vaccination_pct |
---|
It's time to see what our data actually looks like!
# Create 4 subplots, displayed in a 2 by 2 layout
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20,8))
# Group by state
for key, group in df.groupby('state'):
ax1.scatter(group.date, group.tot_cases, 6)
ax1.set(title='State-separated Total COVID-19 Cases over time', ylabel='Number of cases')
ax2.scatter(group.date, group.new_cases, 6)
ax2.set(title='State-separated New Weekly COVID-19 Cases over time', ylabel='Number of cases')
ax3.scatter(group.date, group.partial_or_full_vaccination, 6)
ax3.set(title='State-separated Total COVID-19 Vaccinations over time', ylabel='Number of vaccinations')
ax4.scatter(group.date, group.partial_or_full_vaccination_pct, 6)
ax4.set(title='State-separated Total COVID-19 Percent Vaccinations over time', ylabel='Number of vaccinations as percent of population')
plt.show()
What are we actually looking at in these graphs?
Top-left graph: The number of total COVID-19 cases in each state from 2021 to the present. Each state is represented by a different color, and each individual observation is represented by a point on the scatterplot.
Top-right graph: The number of COVID-19 cases in each state from 2021 to the present, represented by a weekly total. Each state is represented by a different color, and each individual observation is represented by a point on the scatterplot. Notice how the data is much more erratic with highs and lows, compared to the top-left graph which only increases as time goes on (you can't un-count a COVID-19 case).
Bottom-left graph: The number of administered COVID-19 vaccinations in each state. Each state is represented by a different color.
Bottom-right graph: The number of administered COVID-19 vaccinations in each state, divided by the population of the state. Each state is represented by a different color. Notice how when you account for the population of the state, the data is more clustered together. It is important to recognize properties such as this, as external factors that you are not even considering for your model can have a major impact on your model's results.
Just by looking at the graphs, we can gain a lot of very valuable information. First of all, the height of the data points for all of the data are very different for graphs that do not factor in population, but all of the different lines across states seem to follow the same general shape. They all have, peaks, dips, and plateaus in the same places. This indicates that the state in which the data was collected does not play a major part in the distribution of either cases or vaccinations. That is to say: using the state as a variable in the machine learning model will not generate fruitful results.
Instead, we can vastly simplify our task by taking the mean of the data at each date. When doing this, the graphs change as such:
# Take averages of each column, grouped by date
df_avg = df.groupby('date', as_index=False)[['tot_cases', 'new_cases', 'partial_or_full_vaccination', 'partial_or_full_vaccination_pct']].mean()
# 2 by 2 grid of plots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20,8))
ax1.scatter(df_avg.date, df_avg.tot_cases, 6)
ax1.set(title='Total COVID-19 Cases over time', ylabel='Number of cases')
ax2.scatter(df_avg.date, df_avg.new_cases, 6)
ax2.set(title='New Weekly COVID-19 Cases over time', ylabel='Number of cases')
ax3.scatter(df_avg.date, df_avg.partial_or_full_vaccination, 6)
ax3.set(title='Total COVID-19 Vaccinations over time', ylabel='Number of vaccinations')
ax4.scatter(df_avg.date, df_avg.partial_or_full_vaccination_pct, 6)
ax4.set(title='Total COVID-19 Percent Vaccinations over time', ylabel='Number of vaccinations as percent of population')
plt.show()
Now, the graphs are much less cluttered and more easy to understand, and we didn't lose the distributions' unique shapes.
Now, what we all have been waiting for... the Model!
First we need to ask: What are we actually trying to test? If we look back at our hypothesis, we want to know whether or not an increase in COVID-19 vaccinations coincides with a decrease in new COVID-19 cases.
It looks like we have the 2 variables that we need to test ready to go: new weekly cases (df_avg.new_cases
) and total vaccinations over time (df_avg.partial_or_full_vaccination
).
But before we create the machine learning model, let's take another look at the data.
fig, ax = plt.subplots(1, 1)
plt.scatter(df_avg.partial_or_full_vaccination, df_avg.new_cases, 6)
ax.set(title='Number of Vaccinations vs. Number of New Weekly Cases', xlabel='Number of vaccinations', ylabel='Number of cases')
plt.show()
This plot shows both variables of interest, and it looks... interesting. It is difficult to discern any sort of specific correlation just by looking at the graph as-is. Nevertheless, we will plug it in to our model.
The model itself is a Linear Regression model from the scikit learn library (sklearn
). Scikit learn has many different machine learning models to experiment, and you can learn more about them and more about how the model works under the hood by reading their documentation.
The Linear Regression for this example uses Ordinary least squares, which means that the residual sum of squares is used to calculate the loss, and is the value for which the regression model works to minimize.
To fit the model, we randomly split the data. 75% of the data is used to train the model, and 25% is used to test the model.
Scoring of the data is described in the scikit learn documentation:
The coefficient of determination is defined as $(1-\frac{u}{v})$, where $u$ is the residual sum of squares and $v$ is the total sum of squares.
# Identifying our data
X = np.array(df_avg.partial_or_full_vaccination_pct).reshape(-1, 1)
y = np.array(df_avg.new_cases).reshape(-1, 1)
# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Creating the linear regression model and fitting it to our data
reg = LinearRegression()
reg.fit(X_train, y_train)
# Displaying the scatterplot from above with the line of best fit overlayed
plt.scatter(X, y, 6)
plt.plot(X, reg.predict(X), label='y = {:.2f}x + {:.2f}'.format(reg.coef_[0][0], reg.intercept_[0]))
plt.title('Number of Vaccinations vs. Number of New Weekly Cases')
plt.xlabel('Number of vaccinations')
plt.ylabel('Number of cases')
plt.legend()
plt.show()
# Model score
print('Score: ' + str(reg.score(X_test, y_test)))
Score: 0.003763777326642126
We have our results! But how do we interpret them?
First we'll look at the line. Our coefficient is positive and relatively small compared to the data. This means that the Linear Regression model found that the number of total vaccinations has a very slight positive correlation with the number of cases.
Oh no! That conflicts with our hypothesis! Were we wrong? Maybe.
Before we discuss that, let's look at the score...
...and it's more bad news. 1.0 is the best possible score, and it gets worse as you get lower. Anything negative is a very poor fit for the model.
A residual plot confirms the bad news, as the data does not collect around the x-axis like it would in a well-fitted model.
# Residuals calculated by taking the difference between the actual y values and the predicted y values
plt.scatter(X, y - reg.predict(X), 6)
plt.plot(X, np.zeros((114, 1)))
plt.title('Residual Plot of Number of Vaccinations vs. Number of New Weekly Cases')
plt.xlabel('Number of vaccinations')
plt.ylabel('Residuals of number of cases')
plt.show()
Let's try something else. It is pretty clear that there is a massive spike in new cases between November 2021 and April 2022. What if we treat that data as an outlier and remove it from the regression analysis?
# Used to select only rows which lie outside of this date range
datefilter = (df_avg.date < '2021-11-01') | (df_avg.date > '2022-04-01')
fig, ax = plt.subplots(1, 1)
plt.scatter(df_avg.partial_or_full_vaccination[datefilter], df_avg.new_cases[datefilter], 6)
ax.set(title='Number of Vaccinations vs. Number of New Weekly Cases', xlabel='Number of vaccinations', ylabel='Number of cases')
plt.show()
The data with the "outlier" date range cut out is shown above. Now to rebuild the model.
# Same modeling process as before, using the filtered data
X = np.array(df_avg.partial_or_full_vaccination_pct[datefilter]).reshape(-1, 1)
y = np.array(df_avg.new_cases[datefilter]).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
reg = LinearRegression()
reg.fit(X_train, y_train)
plt.scatter(X, y, 6)
plt.plot(X, reg.predict(X), label='y = {:.2f}x + {:.2f}'.format(reg.coef_[0][0], reg.intercept_[0]))
plt.title('Number of Vaccinations vs. Number of New Weekly Cases')
plt.xlabel('Number of vaccinations')
plt.ylabel('Number of cases')
plt.legend()
plt.show()
print('Score: ' + str(reg.score(X_test, y_test)))
Score: -0.07761144181084223
Look at that! Our line of best fit is negatively-sloped and the negative coefficient indicates that there is a negative correlation as we predicted! Does that mean we fixed our model?
Definitely not.
First of all, take a look at the score. It is even worse! You can tell very clearly from the graph that the points are not following a straight line that looks anything like our line of best fit. No matter how tempting, we can't just artificially remove data that would seem to improve our results. The massive spike in COVID-19 cases is a very important feature of the data, even if we can't explain why it's there through our model.. People in the real world probably care about an enormous spike in cases of COVID-19, don't you think?
We can try to manipulate the data all we want, but nothing we do will change the fact that a linear model is a poor fit for this set of data.
Our model was very poor. Does this mean all of our work was for nothing? Does getting vaccinated not do anything to prevent COVID-19?
Our work was NOT a waste. Despite our model's results we DO NOT have the information to conclude the second question. The only thing we can conclude is that our model was not a good fit for our data, and there is NOT a strong correlation between the two specific variables that we tested. The fact of the matter is: COVID-19 is an extremely complicated phenomena that cannot be boiled down to these two variables. There are a significant number of factors that have an influence on whether or not a person gets infected with COVID-19.
However, this is not a dead end! All we did is rule out something that does not work well. Perhaps there is a non-linear machine learning model that fits this data quite nicely and reveals astonishing trends! Data science is a constantly-evolving dynamic process, not a strict assembly line. We can always go back and forth and try new strategies, new data, new models, etc.
Our model may not be able to give us many definitive answers, but it does give us an insight into the nature of the numbers surrounding COVID-19 and the kind of effort it takes to analyze the true impact that COVID-19 has had on our lives. Now that you have followed the entire tutorial, you have traversed through the entire data science pipeline! Congratulations! I hope you learned everything you wanted to learn and I hope that this piqued your interest in the world of data science.
If you're looking for more, try some of these sources!
hevodata.com's explanation of the data science pipeline
geeksforgeeks comprehensive guide to data science with python
The impact of vaccination on COVID-19 outbreaks in the United States - Moghadas, Seyed M et al.