Latest update done on Thu Jun 18 09:27:07 2020

Challenge

COVID-19 outbreak is first and foremost a societal crisis, threatening lives and the well-being of our global community. Society now, more than ever, needs to collaborate to protect people’s lives and health, and search for lasting solutions.

With such vast amount of information on COVID19, and the news being broadcasted every single minute, it’s really daunting to know which geographies are having problems and how large they are.

Didn’t you find difficulties to truly understand what the situation in your country was?
With so many measures to look into, what’s the key metric to look into?
Is it growing in geography X?
How large is the problem in country X compared to others?
That was my initial intent, just having and independent and informed view on the situation.

Solution

COVID data in being collected from many different sources (GitHub, international press, Health Services,…), so it’s critical you choose a credible one. With that in mind, I went into ECDC data and downloaded the data1.

The data is reviewed and approved daily by the ECDC’s Epidemic Intelligence specialist team, between 6-10 am each day, so you can easily download the data through R and get these graphics done is just a few clicks and in less than 15 seconds
There is a key learning here too, when analyzing data; the quality of the data you can work with limits the reliability and value of the outcomes you can get, ‘Trash in, trash out’ is usually said.

Last but not least, unless you have an expert in virus at arm’s length, you should gain some basic or deep knowledge on what COVID is, depending always on how far you want to go with the analysis, but all least gain a basic understanding on key concepts: data measures, how they are being measured so you can give data the value it deserves. This link from ourworldindata will help very much to get a basic understanding in a quick manner. But always be wise to leverage on these knowledgeable sources to reduce your unknowns areas or blind spots. That’s another key learning, the value of your analysis will always be enhanced by people with business domain knowledge. And this is my chance to pinpoint to the CRISP model, to secure proper iterations with the business, who should be responsible for the decisions taken.

… load the basic libraries to make the software work, download the data, some data wrangling stuff, some pretty easy calculations…

#Read  file 
GET("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", authenticate(":", ":", type="ntlm"), 
    write_disk(tf <- tempfile(fileext = ".csv")))
input <- read.csv(tf, header= TRUE, stringsAsFactors = FALSE)

And with the data wrangling exercise finished, I have a basic file, from where I can derive some basic but still insightful conclusions, though with such amount of data it’s worthy to focus your analysis, for example, countries with more than 5,000 cases detected and starting your analysis from week 1 of the year, so you can see how other countries have addressed the COVID challenge since the beginning.

LevelInfection <- 5000
WeekMonitoring <- 1

With these two parameters I focus on specific countries,do some tweaks to plot in decreasing/increasing order to improve the readability of the plot, and we’re ready now to plot the data:

The fact that China was the first country to address the pandemic, it can shed some light on what might happen on other geographies, so let’s plot first what happened there:
Figure showing evolution of new confirmed cases by week and day for China.

Figure showing evolution of new confirmed cases by week and day for China.

In the figure you can observe the weekly evolution of new confirmed cases as reported in ECDC, and it has evolved over time (weeks). There’s another piece of information into the graph, where you can do a daily comparison on a week-by-week basis: Monday this week to Monday the previous week. This is useful to see the decrease/increase on a weekly basis.
You’ll notice that week starts on Wednesday;this is done to reduce the weekend effect when it’s being said that there are delays on reporting of figures.
You can observe in the zoomed section as well the Hammer and the Dance.

Now I plotted the same graph, just over different geographies so you can get a quick snapshot on where they are: have they flattened the curve? IS the number of infected people increasing/decreasing? Can you see clearly any trend? Eventually I obtained this graph, filtered to the most 12 relevant geographies for me (proximity to my home country!), and ordered by infected population in the different countries and with data from Wednesday to Tuesday where you can visually compare how it’s evolving by country on a daily and even week-by-week basis. Note that we have different faceting options so plotting with fixed or variable y-axes can help us compare one country to others or how each country is progressing on its own.

#Filter data for plotting some countries
data.plot <- filter( data.plot,countries == "France" | 
                       countries == "Germany" |   
                       countries =="Italy" |
                       countries =="United_Kingdom" |
                       countries =="Iran" |
                       countries =="Switzerland" |
                       countries =="Turkey" |
                       countries =="Belgium" |
                       countries =="Netherlands" |
                       countries =="Canada" |
                       countries =="Brazil" |
                       countries =="Spain")

ggplot(data=data.plot, aes(x=Week,y=newcases))+
  geom_col(aes(fill=Day.txt))+
  #facet_wrap(vars((TotalConfCases),countries),scales ="free_y")+
  facet_wrap(vars((TotalConfCases),countries))+
  labs(x = "Week", 
       y ="New Confirmed Cases", 
       title =paste("New confirmed cases by day on a week-rolling basis",Sys.Date()),
       subtitle = "Only for selected countries/territories",
       caption = "Data source: European Centre for Disease Prevention and Control",
       fill="Day of the week"
  )+
    theme_secun()
Figure showing evolution by week and day for twelve specific geographies.

Figure showing evolution by week and day for twelve specific geographies.

Now, let’s plot another interesting graph (known as CFR,Cases Fatality Rate), that can lead to more questions rather than answers.
Herewith the differences by countries can suggest different hypothesis (again, to be contrasted with the right people with domain knowledge) or unfortunately you could end up with wrong conclusions from good data (less/more deaths in country X/Y) without taking into account the reliability of data, or even the comparability across countries (which is unfortunately the case, where we can see graphs and graphs comparing apples and pears in many well-known newspapers).

ggplot(data=data.plot, aes(y=Total.Dead.cumsum/1000,x=Total.Cases.cumsum/1000))+
  geom_point()+
  #  geom_text()
  facet_wrap(countries ~ .)+
  labs(y = "Cummulative Known Fatalities(thousands)", 
       x ="Cummulative Known Cases(thousands)", 
       title =paste("Daily fatalities vs confirmed cases. Date=",Sys.Date()),
       caption = "Data source: European Centre for Disease Prevention and Control"
  )+
    theme_secun()
Figure showing evolution of CFR in 6 geograhpies.

Figure showing evolution of CFR in 6 geograhpies.

Do you have now a better understanding of the evolution of pandemics for these geographies? Hopefully yes!

Action

This is the tricky and interesting part of the data analysis2, where you can get some hypothesis/conclusions, right o wrong, but you still should seek the knowledge of domain experts and confront with them.

Looking for example to the pics above, you could say that one country is performing better/worse that others in terms of new confirmed cases or fatalities,or that country X has already gone through the worst part of the pandemic. And they could be entirely wrong: that’s why understanding how data is being collected, processed,measured is critical, as well as some possible caveats (temporal delays in data collection, inaccurate definitions of some data points).

Needless to say that I’m not very keen to get conclusions from this data because I respect very much the people working on this unfortunate topic with knowledge acquired through the study of comparable situations to this one.

Remember that working with data is always a learning process, where the feedback mechanism between the data analyst and professionals in the field can avoid getting the wrong insights because the data is wrong, or some assumptions were done and led to wrong hypothesis and therefore wrong decisions.


  1. Depending on when you run this report, might happen that the data has been updated by ECDC, therefore leading to different results .

  2. This analysis was done with illustrative purposes only. COVID19 will be the topic of the decade, and many studies and analysis have done across the board, and this simple and straightforward example doesn’t try to solve such complex topic at hand but shed a bit of light.