For all the technology we have, it can still be frustratingly difficult to get any concrete information from the media. Sometimes all you want to do is to cut through the noise and see some real numbers. Watching talking heads argue for a half hour probably isn’t going to tell you much about how the COVID-19 virus is spreading through your local community, but seeing real-time data pulled from multiple vetted sources might.

Having access to the raw data about COVID-19 cases, fatalities, and recoveries is, to put it mildly, eye-opening. Even if day to day life seems pretty much unchanged in your corner of the world, seeing the rate at which these numbers are climbing really puts the fight into perspective. You might be less inclined to go out for a leisurely stroll if you knew how many new cases had popped up in your neck of the woods during the last 24 hours.

But this article isn’t about telling you how to feel about the data, it’s about how you can get your hands on it. What you do with it after that is entirely up to you. Depending on where you live, the numbers you see might even make you feel a bit better. It’s information on your own terms, and in these uncertain times, that might be the best we can hope for.

Scraping the CDC Website

If you’re in the United States, then the Centers for Disease Control and Prevention is perhaps the single most reliable source of COVID-19 data right now. Unfortunately, while the agency offers a wealth of data through their Open Data APIs, it seems that things are moving a bit too fast for them to catch up at the moment. At the time of this writing there doesn’t appear to be an official API to pull from, only a human-readable website.

Of course if we can read it, than so can the computer. The website is simple enough that we can split out the number of total cases with nothing more than a few lines of Python, we don’t even need to use a formal web scraping library. It should be noted that this isn’t a good idea under normal circumstances as changes to the site layout could break it, but this (hopefully) isn’t something we need to be maintaining for very long.

import requests

# Download webpage
response = requests.get('https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html')

# Step through each line of HTML
for line in response.text.splitlines():
    # Search for cases line
    if "Total cases:" in line:
        # Split out just the number
        print(line.split()[2][:-5])

Everything should be pretty easy to understand in that example except perhaps the last line. Basically it’s taking the string from the web page, splitting it up using spaces as delimiters, and then cutting the last five characters off the back to remove the closing HTML tag. Definitely a hack, but that’s sort of what this site is all about.

There are …read more

Source:: Hackaday