# COVID-19 in India

A detailed time series and statistical analysis of COVID-19 in India

COVID-19, a novel coronavirus, is currently a major worldwide threat. It has infected more than five million people globally leading to thousands of deaths. In such grave circumstances, it is very important to predict the future infected cases to support the prevention of the disease and aid in the healthcare service preparation. Following that notion, we have developed a model and then employed it for time-series analysis of COVID-19 cases in India. The study indicates an ascending trend for the cases in the coming days. A time-series analysis also presents an exponential increase in the number of cases.

In this post, an overview of the current Coronavirus situation in India has been represented with data that has been crowdsourced and also taken from the repositories of https://www.covid19india.org/. A big thanks to them for providing these data.

For the most APIs that are used in this post has been written in Flask and hosted in AWS. The detailed link for this can be found in the Github link.

First, we will start by importing the necessary libraries.

`import requests`

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

Now, let us make the API call for the custom APIs that are made and hosted in AWS.

response = requests.get("https://api.covid19india.org/data.json")

data1 = response.json()response1 = requests.get("https://thecampfire.in/api/daily_state_count")

data2 = response1.json()

Next, we will look into the data from where we will be finding out the various information about the states regarding the total number of confirmed, active and death cases.

state_list = []for i in range (0, len(data2)-1):

state_names = (data2[i][0])

state_confirmed = (int(data2[i][1]['confirmed']))

state_deaths = (int(data2[i][1]['deaths']))

state_active = (int(data2[i][1]['active']))

state_list.append([state_names,state_confirmed,state_deaths,state_active])state_data = pd.DataFrame(state_list,

columns=['States','Confirmed_Cases', 'Deaths','Active'])

Let us have a look at the state data.

`state_data.head()`

`state_data.tail()`

So, we can see that the data contains state wise details for the total number of confirmed cases, deaths and active number of COVID patients in each state.

state_data_plot = pd.DataFrame(state_data.iloc[:,0:].values, columns = ['States','Confirmed_Cases', 'Deaths','Active'])

state_data_plot.set_index('States', inplace=True)state_data_plot.plot(figsize=(18,5), linewidth=5, fontsize=20)

Now, look at some other features from the data set like the states which have the most number of confirmed cases.

covid_data_states = state_data.sort_values(by='Confirmed_Cases', ascending=False)fig=plt.figure(figsize=(15, 6))

plt.bar(covid_data_states['States'][:5],covid_data_states['Active'][:5],align='center')

plt.ylabel('Number of Confirmed Cases', size = 12)

plt.title('States with maximum confirmed cases', size = 16)

For the current information on the COVID update on India let’s get to know the data from another API.

response3= requests.get("https://thecampfire.in/api/get_daily_count")

data4 = response3.json()print("Current active cases in India as of {} ----> {} ".format(data4['lastupdatedtime'], data4['active']))

print("Current confirmed cases in India as of {} ----> {} ".format(data4['lastupdatedtime'], data4['confirmed']))

print("Current death cases in India as of {} ----> {} ".format(data4['lastupdatedtime'], data4['deaths']))

To get the information from each of the states, on a district level we will be exploring this API: https://thecampfire.in/api/state_data which is customised accordingly. All the detailed code for the API modifications can be found in this link.

response2 = requests.get("https://thecampfire.in/api/state_data")

data3 = response2.json()#state = print("Enter state name to get all district details: ")

state = "West Bengal"

for i in range (0, len(data3)):

if data3[i]['state'] == state:

selected_state_districts = data3[i]['districtData']

for i in selected_state_districts:

print(i['district'])

print("Confirmed: ",i['confirmed'])

print("Active: ",i['active'])

print("Deceased: ",i['deceased'])

print('------------------------')

Here we need to enter the name of the state for which we want to get the details on a district level.

This is the district level data we will be getting from the state of our choice. Here we have entered “West Bengal”.

Now, the age categorisation for COVID patients across India.

response4 = requests.get("https://thecampfire.in/api/age_count_data")

data5 = response4.json()age_list = pd.DataFrame(data5.items())

age_cat = ["1-10","10-25","25-50","50-70","Above 70","Not disclosed"]

fig=plt.figure(figsize=(15, 6))

plt.bar(age_cat,age_list[1],align='center')

plt.ylabel('Number of Confirmed Cases', size = 12)

plt.title('Age Categorisation', size = 16)

Explore the daily accretion for the last 16 days.

response5 = requests.get("https://thecampfire.in/api/daily_total_updates")

data6 = response5.json()covid_updates = pd.DataFrame(data6, columns=['Date','Confirmed','Recovered','Deceased'])

`covid_updates.head()`

`covid_updates.tail()`

Next up, let’s explore the **case time-series** data for the COVID information what we have gathered.

response = requests.get("https://api.covid19india.org/data.json")

data1 = response.json()case_time_series_data = data1['cases_time_series']

data_len = len(case_time_series_data)data = []

#totalDataList = []for i in range (0, data_len):

#print(i)

date = case_time_series_data[i]['date'] + "2020"

#print(date)

confirmed = int(case_time_series_data[i]['dailyconfirmed'])

deceased = int(case_time_series_data[i]['dailydeceased'])

recovered = int(case_time_series_data[i]['dailyrecovered'])

totalconfirmed = int(case_time_series_data[i]['totalconfirmed'])

data.append([date,confirmed,deceased,recovered,totalconfirmed])

For doing this analysis, we need to modify the data a little bit and will be creating a new dataframe object.

`plot_data = pd.DataFrame(df.iloc[:,0:].values, columns = ['date','confirmed','deceased','recovered','totalconfirmed'])`

Now we will be looking at the data and for this post, where we will be working with only the “confirmed” COVID cases in India for this time-series analysis.

`plot_data[['confirmed']].tail()`

# Check Stationarity of the Time-Series

A Time-Series is said to be stationary if its statistical properties such as mean, variance remain constant over time. For practical purposes, we may assume the series to be stationary if the various statistical properties are constant over time like constant mean, constant variance etc.

So, let’s plot the data for the confirmed cases on a daily basis and explore the pattern of it.

`plot_data[['confirmed']].plot(figsize=(20,10), linewidth=5, fontsize=20)`

It is evident from the above plot that there is an increasing trend in the confirmed cases of COVID data from March along with some variations in the month of April and May.

So, we can check the stationarity of the data using two methods:

**Plotting Rolling Statistics:**It means the moving average or moving variance has to be checked whether the series varies with time.**Dickey-Fuller Test:**It tests the null hypothesis that a unit root is present in an autoregressive model. The null hypothesis states that the series is non-stationary.

So, let us define the function that will help us in determining the stationarity of our time series model i.e. [confirmed] mentioned below.

`confirmed = plot_data[['confirmed']]`

The function to check the stationarity:

`from statsmodels.tsa.stattools import adfuller`

def check_time_series_stationarity(timeseries):

rolmean = pd.Series(timeseries).rolling(window=15).mean()

rolstd = pd.Series(timeseries).rolling(window=15).std()

orig = plt.plot(timeseries, color='blue',label='Original')

mean = plt.plot(rolmean, color='red', label='Rolling Mean')

std = plt.plot(rolstd, color='black', label = 'Rolling Std')

plt.legend(loc='best')

plt.title('Rolling Mean & Standard Deviation')

plt.show(block=False)

print('------------------------------')

print('Dickey-Fuller Test')

dftest = adfuller(timeseries, autolag='AIC')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

for key,value in dftest[4].items():

dfoutput['Critical Value (%s)'%key] = value

print(dfoutput)

Let’s find out what this function tells us about our time-series model.

`check_time_series_stationarity(confirmed)`

We can clearly see from the graph that the ‘Rolling Mean’ is increasing with time and the Standard Deviation has more or less small variations with time. And from the Dickey-Fuller test, we can also find that the Test Statistic value is more than the Critical Values.

So, we can conclude that our time-series is not stationary with time.

# Process of Estimating & Eliminating Trend from Time-Series

First and foremost, we need to perform transformation operations on our series to reduce trends. Here in our case, we can see that the trend is positive, so we can apply methods like log, square root, cube root etc.

`#Estimating & Eliminating Trend`

confirmed_log = np.log(confirmed)

plt.plot(confirmed_log)

From this plot, we can find that there is a positive trend in the series after a particular time when the confirmed corona cases started to increase in India. But it's not very intuitive in the presence of noise. So we can use some techniques to estimate or model this trend and then remove it from the series.

## Moving average

Using this method, we take an average of “n” consecutive values depending on the frequency of time series.

`#Moving average`

moving_avg = pd.Series(confirmed_log).rolling(window=15).mean()

plt.plot(confirmed_log)

plt.plot(moving_avg, color='red')

From the above plot, it shows the rolling mean in the red line. Since we are taking an average of 15 values, deduct the not defined values from the original series.

`confirmed_log_moving_avg_diff = confirmed_log - moving_avg`

confirmed_log_moving_avg_diff.head(15)

Now plot the series by dropping those “NaN” values and let’s find the stationarity of the series.

`confirmed_log_moving_avg_diff.dropna(inplace=True)`

check_time_series_stationarity(confirmed_log_moving_avg_diff)

This plot seems to be a better series than what we got previously. The rolling values appear to be varying slightly but there is no specific trend. Also, the test statistic value seems almost near to the critical values. We can see that the rolling mean values after the observation point of around 90 i.e. from 29th April 2020, tend to vary in a small rate. So we can conclude that this series is tending towards a stationary series.

# Eliminating Trend and Seasonality from Time Series

Now, we are almost coming to the end of this post. Just a few more things and we are done with the analysis of COVID cases in India.

## Decomposition

In this approach, we will find the trend and seasonality of our time series are modelled separately.

`from statsmodels.tsa.seasonal import seasonal_decompose`

cnf = plot_data['confirmed']

for i in range (0, len(cnf)):

val = cnf[i]

if val == 0:

cnf[i] = 1

decomposition = seasonal_decompose(cnf)

trend = decomposition.trend

seasonal = decomposition.seasonal

residual = decomposition.resid

fig=plt.figure(figsize=(15, 6))

plt.subplot(411)

plt.plot(cnf, label='Original')

plt.legend(loc='best')

plt.subplot(412)

plt.plot(trend, label='Trend')

plt.legend(loc='best')

plt.subplot(413)

plt.plot(seasonal,label='Seasonality')

plt.legend(loc='best')

plt.subplot(414)

plt.plot(residual, label='Residuals')

plt.legend(loc='best')

plt.tight_layout()

Here we can see that the trend, seasonality are separated out from data and we can model the residuals. Let's check the stationarity of residuals:

`cnf_decompose = residual`

cnf_decompose.dropna(inplace=True)

check_time_series_stationarity(cnf_decompose)

Here our Dickey-Fuller results, where test statistic is significantly lower than the 1% critical value. So this Time Series is very close to stationary.

So, on the basis of our analysis of the COVID-19 data for the “Confirmed” cases in India, we find from the first part that the number of cases is increasing on a daily basis and states like Maharashtra, Delhi, Gujarat with the highest number of confirmed cases as of now. Also, we have found that mostly people with the age above 50 are affected due to it. We need to flatten this increasing curve by maintaining social distance, proper hygiene and other norms that are initiated by our government. With the second part of our time-series analysis, we tend to look for the stationarity of the data (only we check for the confirmed cases in this post) and estimating the trend.

Please stay safe and stay healthy in this moment of crisis all along with the world. Do help others and stand beside each other to overcome this pandemic. We can only defeat this when we maintain social distancing thus by flattening the curve.

Github link: https://github.com/BitanBhowmick/COVID-19-Statistical-and-Time-Series-Analysis

*Note from the editors: **Towards Data Science** is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click **here**.*