COVID-19 outbreak: the case of Portugal(Update)

16 Apr 2020

COVID-19: a data epidemic

With the outbreak of COVID-19 one thing that is certain is that never before a virus has gone so much viral on the internet. Especially, a lot of data about the spread of the virus is going around.

In the space of a few weeks, complex terms like “group immunity” (herd immunity) or “flattening the curve” started to circulate on social networks.

Here we will try an approach that demonstrates the limitations of the model from the previous post The importance of “flattening the curve”.
This post is based on Contagiousness of COVID-19 Part I: Improvements of Mathematical Fitting (Guest Post), this post compared to the previous one shows that the fitting of the data with the SIR model is tricky, the general problem is that the fitting-algorithm is not always finding it’s way to the best solution.

Why doesn’t the algorithm converge to the optimal solution?

There are two main reasons why the model is not converging well.

Early stopping of the algorithm

The first reason is that the optim algorithm is stopping too early before it has found the right solution.

Ill-conditioned problem

The second reason for the bad convergence behaviour of the algorithm is that the problem is ill-conditioned.

The latest data from:

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

x <- util_scrape_table("https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

deaths <- util_scrape_table("https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
recovered <- util_scrape_table("https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")

y <- x %>% row_to_names(1) %>% clean_names %>% remove_empty() %>%
    gather(date, infections, -1:-4) %>%
    mutate(date = str_remove_all(date, "x")) %>%
    mutate(date = mdy(date)) %>%
    mutate(infections = parse_number(infections))

dy <- deaths %>% row_to_names(1) %>% clean_names %>% remove_empty() %>%
    gather(date, infections, -1:-4) %>%
    mutate(date = str_remove_all(date, "x")) %>%
    mutate(date = mdy(date)) %>%
    mutate(infections = parse_number(infections))

ry <- recovered %>% row_to_names(1) %>% clean_names %>% remove_empty() %>%
    gather(date, infections, -1:-4) %>%
    mutate(date = str_remove_all(date, "x")) %>%
    mutate(date = mdy(date)) %>%
    mutate(infections = parse_number(infections))
###################
from_date <-  dmy("02/03/2020")
sir_start_date <- "2020-03-02"
####################

pti    <- y %>% filter(country_region == "Portugal") %>%
    filter(date >= from_date) %>%
    group_by(date) %>% summarise(infections_pt = sum(infections)) #%>%

ptd    <- dy %>% filter(country_region == "Portugal") %>%
    filter(date >= from_date) %>%
    group_by(date) %>% summarise(death_pt = sum(infections)) #%>%

ptr    <- ry %>% filter(country_region == "Portugal") %>%
    filter(date >= from_date) %>%
    group_by(date) %>% summarise(recovered_pt = sum(infections)) #%>%



pt <- pti %>%
    left_join(ptd, by ="date") %>%
    left_join(ptr, by ="date")

Infected <- pt$infections_pt
Infected

##  [1]     2     2     5     8    13    20    30    30    41    59    59   112   169   245   331   448   448
## [18]   785  1020  1280  1600  2060  2362  2995  3544  4268  5170  5962  6408  7443  8251  9034  9886 10524
## [35] 11278 11730 12442 13141 13956 15472 15987 16585 16934 17448 18091

D <- pt$death_pt
D

##  [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   2   3   6  12  14  23  33  43  60
## [26]  76 100 119 140 160 187 209 246 266 295 311 345 380 409 435 470 504 535 567 599

R <- pt$recovered_pt
R

##  [1]   0   0   0   0   0   0   0   0   0   0   0   1   2   2   3   3   3   3   5   5   5   5  22  22  43
## [26]  43  43  43  43  43  43  68  68  75  75 140 184 196 205 233 266 277 277 347 383

Day <- 1:(length(Infected))
#N <- 1400000000 # population of mainland china
N <- 10276617 # população de Portugal do INE

RSS <- function(parameters) {
    names(parameters) <- c("beta", "gamma")
    out <- ode(y = init, times = Day, func = SIR, parms = parameters)
    fitInfected <- out[,3]
    sum((Infected - fitInfected)^2)
}

SIR <- function(time, state, parameters) {
    par <- as.list(c(state, parameters))
    with(par, {
        dS <- -beta/N * I * S
        dI <- beta/N * I * S - gamma * I
        list(c(dS, dI))
    })
}

SIR2 <- function(time, state, parameters) {
    par <- as.list(c(state, parameters))
    with(par, {
        dS <- I * K * (-S/N *  R0/(R0-1))
        dI <- I * K * ( S/N *  R0/(R0-1) - 1/(R0-1))
        list(c(dS, dI))
    })
}

RSS2 <- function(parameters) {
    names(parameters) <- c("K", "R0")
    out <- ode(y = init, times = Day, func = SIR2, parms = parameters)
    fitInfected <- out[,3]
    #fitInfected <- N-out[,2]
    sum((Infected - fitInfected)^2)
}

### Two functions RSS to do the optimization in a nested way
Infected_MC <- Infected
SIRMC2 <- function(R0,K) {
    parameters <- c(K=K, R0=R0)
    out <- ode(y = init, times = Day, func = SIR2, parms = parameters)
    fitInfected <- out[,3]
    #fitInfected <- N-out[,2]
    RSS <- sum((Infected_MC - fitInfected)^2)
    return(RSS)
}
SIRMC <- function(K) {
    optimize(SIRMC2, lower=1,upper=10^5,K=K, tol = .Machine$double.eps)$objective
}

#####
# wrapper to optimize and return estimated values
getOptim <- function() {
    opt1 <- optimize(SIRMC,lower=0,upper=1, tol = .Machine$double.eps)
    opt2 <- optimize(SIRMC2, lower=1,upper=10^5,K=opt1$minimum, tol = .Machine$double.eps)
    return(list(RSS=opt2$objective,K=opt1$minimum,R0=opt2$minimum))
}

# starting condition
init <- c(S = N-Infected[1], I = Infected[1]-R[1]-D[1])
#init <- c(S = N-Infected[1], I = Infected[1])

Result of estimation of the reproduction number (R0)

Opt <- optimr(par=c(0.5, 0.5), fn=RSS,
              #x=x,
              method = "L-BFGS-B",
              lower=c(0, 0),
              upper = c(1, 1),
              control = list(parscale = c(10^-4,10^-4)))
Opt$message

## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

#
Opt_par <- setNames(Opt$par, c("beta", "gamma"))
Opt_par

##      beta     gamma 
## 1.0000000 0.7826315

#
R0 <- as.numeric(Opt_par[1] / Opt_par[2])
R0

## [1] 1.277741

Currently the calculation of R0 by the model of the previous post

gives: R0 = 1.2777406

# fit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par))
#
# height_pand<- fit[fit$I == max(fit$I), "I", drop = FALSE] # height of pandemic
# height_pand
# attributes(height_pand)
# height_day <- as.integer(row.names(height_pand))
# from_date + height_day # height of pandemic
# # [1] "2020-05-02"
################################################################
Opt3 <- getOptim()
Opt3

## $RSS
## [1] 163285478
## 
## $K
## [1] 0.2632098
## 
## $R0
## [1] 1.06201

Opt3$R0

## [1] 1.06201

#  Opt3$R0
# [1] 1.0607

# Opt_par2 <- setNames(Opt2$par, c("K", "R0"))
# Opt_par2

R0 by the current model

The current model gives: R0 = 1.0620105 as we see the result is quite different.

Some differences of the SIR model

This SIR model uses the parameters $K = \beta - \gamma$ and $R_0 = \frac{\beta}{\gamma}$

Opt_par3 <- setNames(Opt3[2:3], c("K", "R0"))
Opt_par3

## $K
## [1] 0.2632098
## 
## $R0
## [1] 1.06201

# plotting the result
t <- 1:80 # time in days

fit3 <- data.frame(ode(y = init, times = t, func = SIR2, parms = Opt_par3))

par(mar = c(1, 2, 3, 1.1)) # Set the margin on all sides to 2
# par(mar = c(bottom, left, top, right))
old <- par(mfrow = c(1, 1))

#Opt3 <- getOptim()
#Opt_par3 <- setNames(Opt3[2:3], c("K", "R0"))
Opt_par3

## $K
## [1] 0.2632098
## 
## $R0
## [1] 1.06201

##############
from_date <-  dmy("02/03/2020")
sir_start_date <- "2020-03-02"

height_pand<- fit3[fit3$I == max(fit3$I), "I", drop = FALSE] # height of pandemic
height_pand

##           I
## 41 17870.26

#           I
# 57 330697.5 # 10-04-2020
# 58 318210.1 # 11-04-2020
attributes(height_pand)

## $names
## [1] "I"
## 
## $row.names
## [1] 41
## 
## $class
## [1] "data.frame"

height_day <- as.integer(row.names(height_pand))
height_day

## [1] 41

max_infected <- max(fit3$I)
height_of_pandemic <- max_infected

forecast_infections <- fit3$I

max_infected_day <- from_date + height_day # height of pandemic
max_infected_day

## [1] "2020-04-12"

# [1] "2020-04-11"

height_of_pandemic_date <- max_infected_day

actual_infections <- Infected

days_from_inic <- data.frame(
    date = as.Date(from_date + t))

#
fit_i <- fit3 %>%
    cbind(days_from_inic)

fit_i$actual_infections <- c(Infected, rep(NA, nrow(fit_i)-length(Infected)))

#
full_forecast_plot <- fit_i %>% # output_df
  ggplot(aes(x = date)) +
  geom_line(aes(y = I / 1000), colour = "blue") +
  geom_point(aes(y = actual_infections / 1000), colour = "red") +
  geom_vline(xintercept = height_of_pandemic_date, line_type = "dotted") +
  annotate(geom = "text",
           x = height_of_pandemic_date,
           y = (height_of_pandemic / 1000) / 2,
           inherits = FALSE,
           label = str_glue("Max. infections of {comma(height_of_pandemic / 1000, accuracy = 0.01)} on {height_of_pandemic_date}"),
           color = "black",
           angle = 90,
           vjust = -0.25,
           size = 3) + #  size = 4
  
  theme_bw() +
  xlab("") +
  ylab("Number of Infections - Thousands") +
  scale_x_date(breaks = scales::pretty_breaks(n = 12))+   # for one year date labels
  scale_y_continuous(label = comma) +
  ggtitle(label = "Portugal - Corona Virus Infections Forecast",
          subtitle = "Blue Points are Forecast and Red Points are Actuals")
# caption  = "Data: https://github.com/CSSEGISandData/COVID-19")


full_forecast_plot

plot of chunk unnamed-chunk-6

height_of_pandemic_date

## [1] "2020-04-12"

Final conclusions

Based on this model the peak of the epidemic has already happened on day 2020-04-12 with a maximum of 17 870.
This is all we want, but it doesn’t mean it is correct. Just because we use a mathematical model does not mean that our conclusions/predictions are trustworthy. We need to challenge the premises which are the underlying data and models.

References

If you liked this post, you can share it with your followers or follow me on Twitter!

Data Science and R

Home About Archive Links

COVID-19 outbreak: the case of Portugal(Update)

COVID-19: a data epidemic

Why doesn’t the algorithm converge to the optimal solution?

Early stopping of the algorithm

Ill-conditioned problem

The latest data from:

Result of estimation of the reproduction number (R0)

Currently the calculation of R0 by the model of the previous post

R0 by the current model

Some differences of the SIR model

Final conclusions

References

Data Science and R

Home About Archive Links

COVID-19 outbreak: the case of Portugal(Update)

COVID-19: a data epidemic

Why doesn’t the algorithm converge to the optimal solution?

Early stopping of the algorithm

Ill-conditioned problem

The latest data from:

Result of estimation of the reproduction number (R0)

Currently the calculation of R0 by the model of the previous post

R0 by the current model

Some differences of the SIR model

Final conclusions

References

Related Posts

The importance of “flattening the curve” 16 Apr 2020

The importance of “flattening the curve” 16 Apr 2020

MRI image segmentation 08 Jul 2015