I have started to look at the number of people killed by drones in Pakistan because I wanted to produce some kind of multimedia interactive project using Zeega, a particular new on-line tool which is possible to use to combine videos, picture and sound uploaded in the Cloud. However, I wasn’t able to stand working with this tool. But I’ll talk about that in a future post.
One of the subject which collects data from various sources is the Bureau of Investigative Journalism, that have a special section with the data from Pakistan, Yemen and Somalia since 2009 until now. The problem with them is that they are not in a form that can be downloaded. Indeed they are just wrote down in a sort of list.
The three levels of Outwit Hub
What can we do? The solution had been suggested by Paul Bradshaw, who recommend to use the =importXml of Google docs. I had a go with that but the html of the site is not formatted in a way that I could deal with, due surely to my lack of experience with that. Reading the code I thought that perhaps Outwit Hub could be the proper mean to reach my target. I have already used in the past this tool and it has been revealed to be really useful, mostly because, if you are lucky, it doesn’t need programming literacy. The potentiality of the software can be divided into three levels:
- level 1: it is enough to paste the link and click on the table, list or guess buttons on the left. This is what I did in a previous post.
- level 2: we need to look at the code with the scraper function. Found the information we need in the code it is sufficient to insert the tags before and after of them in the “marker before” and “marker after” sections.
- level 3: use the regular expression language.
As Regex looked difficult to learn, I aimed to use the level 2 of Outwit.
After an initial struggle I found a way to get two columns, one for the date and one for the killings. Seeing that the code before the month of the attacks is <p><strong> I have put it in the “marker before” and the year of the attacks as a “marker after”, because I obviously don’t need it.
For what concern the numbers of people killed I put a strange poligonal symbol as a “marker before” and then the word “killed” as a “marker after“.
And that’s it. As a result we got a table with more than than two columns, with the others having the injured of the strikes and the children killed. Repeated the same process 5 times for the pages referring to a specific year since 2009, we got 5 tables with the 2 columns which we are interest in, date and number of killed.
The good old Excel
What we have to do now is merge the tables in one. I thought initially to test the merge function of Google Fusion Tables but I wasn’t confident that the results would have been interesting for what I needed, so I used Excel.
The first column contains a code followed by the date, while the second has the minimum number of killed separated by a -. Highlighting the entire column I used the “text in column” function to separate them. But as a result of this I had the month and the number of the day separated as well. It is not a bad news anyway. Because just inserting in a left column the year and using the CONCATENATE function to merge the three column I had what I was looking for, a date column and, well, two columns with the killings, the minimum one and the maximum.
Why two columns? This is because the Bureau rely on different sources and sometimes the data collection in such drammatic event comes to different subjects, so they include a range between the minimum estimated number of people killed and the maximum one.
Anyway I had to make sure that the CONCATENATE function included also the slash between the year, month and day. This is possible just inserting a column, clicking in the first cell, writing “=CONCATENATE”, selecting the day column in text 1 box, writing “/” in text 2, and repeating that for the month and year columns.
But it’s not finished as we need to clean a bit the spreadsheet. In fact we have some extra words, like “total“, in the killed columns and, in the maximum column a few cells are empty. This is because the number of the victims are certain. So first of all we need to highlight the killed columns, format them from “general” to “numbers” so the words suddenly disappear. And then copy for each empty cell the figure of the minimum column.
The speadsheet should be like this.
Probably there is a better way to do this operation.
Now we have five organic spreadsheets, that can be merged by copy and paste.
Visualizing with Datawrapper
The next step is visualize them. Because I have already used Tableau Public, this time I want to take advantage of the on-line resource Datawrapper.
It is a simple-to-use tool, that can offer different kind of visualizations, giving also the possibility to the reader to get the data that the creator used. This is a review by journalism.co.uk.
The process is divided into 4 steps.
- First of all we have to copy and paste our spreadsheet.
- Secondly we are offered to choose if the row or column to use as label and to indicate our source.
- In the next step we can select the kind of visualization to use. For data through a period of time I personally prefer the line graph. So, selecting the line graph, what we see is a mess, because the max number of dead cover the minimum. The solution may be to use the highlight function to spotlight the min number of dead.
- The 4th step allows us to publish our work. I personally decreased the heigth and increase the width, so the trend results clearer.
This is the result.