First assignment: exploring Data journalism and visualizations

For this assignment I have chosen to explore the data journalism field and how to visualize data and make interactive the contents.

Completeness of a story

This decision came from my past experience as a journalist working in a traditional local newspaper, the Corriere Adriatico. Here I found my work terribly limitating, as what I have been supposed to do was just reporting what were happening in the town. All the times that I had some data to deal with I wasn’t able to visualize them in order to make my stories less boring. Or, moreover, I was unable to build a story on them because of my editors’ lack of trust in this practices. Data journalism represents to me the possibility to make a story more complete and attractive for the readers. This is why I don’t really like to discern the journalism that uses data from the traditional one, because this way should be followed by all the traditional media.

Useful for this purpose has been the “Introduction to infographics and data visualizations” held by Alberto Cairo through which, before to start the course, I have learned how to use a Tableau Public software, and how to judge a visualization.

What means to do data journalism today

I really enjoyed the similarity found by Simon Rogers, editor of the Guardian Datablog, between data journalism and punk, because in fact this describe pretty well the reality of this field. Anyone can do it. With the difference that making punk music involves to buy an instrument, while the Web is plenty of free useful tools that can be used to tell stories with numbers. What I have used mostly are Tableau Public, Datawrapper and Infogr.am for visualizations, Google Docs and Outwit Hub as scraping tools.

Data means describe the reality. Really?

The interesting fact about data is that everyone expects them to be the better medium to describe the reality. They are even the best medium used by scientists to describe the universe. But, how are them collected? This reading have been really helpful to know the answer to this question, describing how surveys and procedures may be flawed or in incomplete, undermining sometimes the reliability of the information collected. A good example experienced during my activity is represented by the data about the people killed in Pakistan by drones during the Obama administration. What I based my work on is the ones collected by the Bureau of Investigative Journalism, while different data are collected by the New America Foundation. Who can we trust? Additionally, the data I used, involve the minimum number of the deaths and the maximum one, because the information the Bureau got derives from different sources, that claim different numbers.

Communities of practice

Throughout the web it is possible to find a wealthy quantity of community built around data journalism. One of this is surely Hacks and Hackers. I met the Birmingham group during a introductive lesson about data analysis. Another group is the Data Driven Journalism group. Being subscribed to the mailing list, I have asked to the group, a bit worry to be taken like a ingnorant, what would be the best tool to create some kind of interactive product by merging data with other forms of media, like videos. And they responded enthusiastically recommending a few of them that I already knew.

A second assignment proposal 

A future project that I would like to realize is a focus on the European farm subsidies. More then once the European Parliament has been accused to don’t meke transparent their fundings to farm activities. What I would  like to do is to get the data from an interesting website, called farmsubsidies.org, that aims to collect information about this topic, making a series of visualizations to show where this money go and who really get these financial helps.

From scraping to visualizing the killings by drones in Pakistan. The Outwit Hub (level 2)/Datawrapper combination.

I have started to look at the number of people killed by drones in Pakistan because I wanted to produce some kind of multimedia interactive project using Zeega, a particular new on-line tool which is possible to use to combine videos, picture and sound uploaded in the Cloud. However, I wasn’t able to stand working with this tool. But I’ll talk about that in a future post.

One of the subject which collects data from various sources is the Bureau of Investigative Journalism, that have a special section with the data from Pakistan, Yemen and Somalia since 2009 until now. The problem with them is that they are not in a form that can be downloaded. Indeed they are just wrote down in a sort of list.

The three levels of Outwit Hub 

What can we do? The solution had been suggested by Paul Bradshaw, who recommend to use the =importXml of Google docs. I had a go with that but the html of the site is not formatted in a way that I could deal with, due surely to my lack of experience with that. Reading the code I thought that perhaps Outwit Hub could be the proper mean to reach my target. I have already used in the past this tool and it has been revealed to be really useful, mostly because, if you are lucky, it doesn’t need programming literacy. The potentiality of the software can be divided into three levels:

  •  level 1: it is enough to paste the link and click on the table, list or guess buttons on the left. This is what I did in a previous post
  • level 2: we need to look at the code with the scraper function. Found the information we need in the code it is sufficient to insert the tags before and after of them in the “marker before” and “marker after” sections.
  • level 3: use the regular expression language.

As Regex looked difficult to learn, I aimed to use the level 2 of Outwit.

After an initial struggle I found a way to get two columns, one for the date and one for the killings. Seeing that the code before the month of the attacks is <p><strong> I have put it in the “marker before” and the year of the attacks as a “marker after”, because I obviously don’t need it.

For what concern the numbers of people killed I put a strange poligonal symbol as a “marker before” and then the word “killed” as a “marker after“.

Immagine

And that’s it. As a result we got a table with more than than two columns, with the others having the injured of the strikes and the children killed. Repeated the same process 5 times for the pages referring to a specific year since 2009, we got 5 tables with the 2 columns which we are interest in, date and number of killed.

The good old Excel

What we have to do now is merge the tables in one. I thought initially to test the merge function of Google Fusion Tables but I wasn’t confident that the results would have been interesting for what I needed, so I used Excel.

The first column contains a code followed by the date, while the second has the minimum number of killed separated by a -. Highlighting the entire column I used the “text in column” function to separate them. But as a result of this I had the month and the number of the day separated as well. It is not a bad news anyway. Because just inserting in a left column the year and using the CONCATENATE function to merge the three column I had what I was looking for, a date column and, well, two columns with the killings, the minimum one and the maximum.

Why two columns? This is because the Bureau rely on different sources and sometimes the data collection in such drammatic event comes to different subjects, so they include a range between the minimum estimated number of people killed and the maximum one.

Anyway I had to make sure that the CONCATENATE function included also the slash between the year, month and day. This is possible just inserting a column, clicking in the first cell, writing “=CONCATENATE”, selecting the day column in text 1 box, writing “/” in text 2, and repeating that for the month and year columns.

But it’s not finished as we need to clean a bit the spreadsheet. In fact we have some extra words, like “total“, in the killed columns and, in the maximum column a few cells are empty. This is because the number of the victims are certain. So first of all we need to highlight the killed columns, format them from “general” to “numbers” so the words suddenly disappear. And then copy for each empty cell the figure of the minimum column.

The speadsheet should be like this.

Killed by drones spreadsheet

Probably there is a better way to do this operation.

Now we have five organic spreadsheets, that can be merged by copy and paste.

Visualizing with Datawrapper

The next step is visualize them. Because I have already used Tableau Public, this time I want to take advantage of the on-line resource Datawrapper.

It is a simple-to-use tool, that can offer different kind of visualizations, giving also the possibility to the reader to get the data that the creator used. This is a review by journalism.co.uk.

The process is divided into 4 steps.

  • First of all we have to copy and paste our spreadsheet.
  • Secondly we are offered to choose if the row or column to use as label and to indicate our source.
  • In the next step we can select the kind of visualization to use. For data through a period of time I personally prefer the line graph. So, selecting the line graph, what we see is a mess, because the max number of dead cover the minimum. The solution may be to use the highlight function to spotlight the min number of dead.
  • The 4th step allows us to publish our work. I personally decreased the heigth and increase the width, so the trend results clearer.

This is the result.

From Scraping to Mapping the local currencies: the Outwit Hub/Fusion Table combination (part 2)

What follows is the description of my personal experience of mapping the unmappable, through a classic “trial and error” learning.

Google Fusion Table is a really useful and versatile tool, that can be used for multiple tasks, like merge two tables or even create charts. However, I have personally used it mostly to map specific locations, like cities or others.

Once in Google Drive website it is sufficient to click on “create/Fusion Table” and import the spreadsheet from our computer. After the preview and having inserted the description of the project it is possible to see this page.

Local currencies

At this point I’d suggest to see what the map visualize clicking on the “Map” button on the top.

The locations are supposed to be more then 100, but just about half of the group is de facto visualized. What’s the reason? It’s really simple. Most of them are not indeed proper location. There are places called “locale”, “francophone”, “movement national”, or “All communities”. They simply describe generally where these systems can work, so, while some of them are proper mappable locations, some others are not.

So, what to do? In case we desperately need this map I’d suggest ignorantly to modify each of them in order to have at least a country as a locations. In this way “Scheme UK wide” will become just “UK”, or “Francaise” will be “Belgium” (it refers to the French part).

Does it look better? Yes. Is it what we had in mind? Absolutely not. Again, as I have found this topic really interested as I wrote in part 1, I really had the intention to show it. The worst part came when I studied a bit the trends of these currencies.  What I wanted to do was a trend chart which would have shown how many local systems were created yearly, and I thought to take into consideration the numbers represented in this page.

However, digging in the site, I have found that they don’t refer to their date of creation, but just the date in which the joined the website system. Which means that other important systems may exist but, as they might not know this website, they are not registered. And what I wanted to should might be just a part of the entire topic.

Do you think it is another failure? Probably, but this experience could lead to another open question: the lack of direct data and the impossibility of their collection in some circumstances.

In case you are interested in the map I obviously couldn’t embed it in this wordpress site, This is just snapshot. Enjoy!

Local currencies

 

Trial and error with waste: a Birmingham Mail experience

When it comes to learn how to make journalism, one of the best way is surely the experience in the field. At least this has worked for me so far, initially with my Italian “newspaper” prose and the interpretation of the local regulations, and now with the data analysis.

not_all_learning_in_the_field

It was a friday when I knew the Birmingham Mail might have been interested in an article based on a dataset I found in Defra (Department of Environment, Food and Rural Affairs) website, and that they wanted something before a budget discussion, held on the following Tuesday. A three day deadline, a dataset and quotes to ask. Not so bad, but considered my inexperience in dealing with data I had a lot of reasons to be anxious.

The data provide the annual figures about waste management in all the British regions. As it can be seen there are 4 tables, where the first one is probably the most important, as it gves all (or, how we’ll see later, the most) of the management particulars. They are not so many, but they may be tricky for a not expert eye like mine and, what I have realized at the end of the day, is that it took me too long to analyse these data, to visualize them and to write a simple story.

In each table there’s a big red phrase on the bottom who recommend to avoid to sum up the data as they may result in a double counting, which in the first instance I obviously ignored. Basically we can divide the information the dataset provides in three, that are the total waste collected by local authorities, and the household and commercial parts of the total. While the first represents what the councils collect from the citizen’s dwellings, the second one is about the waste produced by industrial and commercial activities. For every category there are many entries, among which the recycling and not recycing part, that are what we are interested in.

Knowing something vague about some European targets to be met in a short time in Italy, what I wanted to see was the Birmingham City position in a recycling rate list. Having the total amount of waste collected and the recycled one both in tonnes, it is possible to calculate the percentage of the latter (total waste recycled /total waste), and set the data in a “from the smallest” list. To be honest this is useless as it is already calculated in the table 2, indeed the one I was considering, but it’s good to practice anyway. Doing this we can see that Birmingham is at the 6th position, the smallest number after Westmister and others that are not part of the main island. It is a story! And it was until I spoke with a member of the City Council.

The cold sweat: always read the dataset description! 

At this point it is worth to say why the data are so tricky. In table 1 we have about 380 local authorities and in table 2 just 120. And it the fist one Birmingham was on the 28th position! Why? 

While I was thinking that my story was irreparable flawed, I read the “notes for tables” part of the dataset, in which the reason is clearly explainedSome kind of authorities are described like being collection authorities, while the others are unitary and disposal. Some of the collection ones are included in others authorities like a Russian doll (Matrioska), so it is explained the reason of the big red warning. The solution? The same notes explain that it is enough to exclude the collection authorities, and in fact our Birmingham City Council came back to 6th position. 

images

Always seek a response  

At this point, and had already missed the Tuesday deadline because of several reasons, after asked the Council for a quote I have been warned by one of its spokesperson that my consideration of the total waste rather then just the household one is wrong.

In fact almost every legislation in this topic consider just the household waste collection as it is directly collected by the Council, while who deals with the commercial one are in big part the companies themselves through private contractors. What it is described in the spreadsheet is just a small part of what the Council collects from this companies, amounting in just less then 20% of the total, while the rest is covered in the shade, as the City Council do not hold such information.

The story is broken

At this point the disappointment was inhibited by the possibility of reshaping the story from “the second worst of the Mainland outside London” to the smaller sized “the worst in the West Midlands which has been published anyway.

However, the importance of this experience has been inevaluable considered my learning the way to read and interpretate datasets, and to deal with local authorities.

From Scraping to Mapping the local currencies: the Outwit Hub/Fusion Table combination (part 1)

What follows is the description of my personal experience of mapping the unmappable, through a classic “trial and error” learning.

Let’s start from the problem. Let’s say we want to produce something about alternative currencies. It is pretty attractive in these hard times, while we are watching the social catastrophe it is happening in countries like Greece, Spain and Italy if there are alternatives to the money that we are using.

Let’s say we ideally want to do a map for an overview of the European landscape and some kind of trend chart to show the average evolution of these over the time. As explained in a previous post, in which I played a bit with Tableu, we need a spreadsheet to which work on. 

From my perspective the solutions are two: the evergreen “copy and paste”, or a more professional scraper. While with the first solution I am not sure to get some nice results to be put on an Excel table, using the second one may more likely lead to a well organized spreadsheet.

Scraping the link 

According to the book Scraping for journalists by Paul Bradshaw, one of the best tools to use in this case, and that can (sometimes) work without some programming skills is Outwit hub. It needs to be downloaded from this site.

First of all I have based my work on the database of the Complementary Currency Resourse center. This site is pretty tricky as once you click on one of the links used to navigate the site the url doesn’t change. A scraper most of times need a specific url and, as we want just an European overview of local currencies, it is necessary to have a look at the html version of the site (right click and inspect element), going with the cursor over the code correspondeted to the highlighted part, and select the part of the “europe” link, which is the 4th <li> tag under the <ul class = “sideMenu2”> one. Here it is possible to find the real one, which is

http://www.complementarycurrency.org/ccDatabase/le_systems_admin.php?s_le_regionId=7

Now that we have the proper link we can work on it with Outwit Hub, just inserting it into the url space on the top of the software page. Then it is just enough to click on the table option on the left and that’s it, isn’t it?

No! 

Currency picture

Trial and error

Because in this way we have just part of the list that we need. In fact at the bottom of the page there are the links for the rest (51-100, 101-108). So, what’s the solution? Just have a look at the html as before and we can find out in the <table width “99%” align “center”> tag that there are a couple of different url. Looking at them we can notice that in the middle they have 1&s and 2&s, so the first must be the one with 0&s.

outwit picture 2

 So we have to copy and paste these links in Outwit. I personally did this three times, exported the results in excel and then copy and paste to have an organic list. But there must be a better and more professional solution.

At this point it would be worth to eliminate the “stopped” category as, as the category’s name suggests, they have stopped to work as currencies anymore and other useless columns, with the aim to have just the “community“, “local exchange system” and the “url” ones.

Here we notice a nice trouble: in the local exchange system column, both the name of the currency and the type of it are in the same cells, separated just by a semicolon.

The solution? Easy, using at least my old school version of Excel . Here‘s the exact procedure explained by the Microsoft staff.

Then what I did was to insert the country column aside the community one, because i was expecting to use tableau Public like I did the time before, but I indeed wanted to change approach and I opted for a new tool, Google Fusion tables….

continue

Enterprise Project pitch and description

As I mentioned in an early post my enterprise project is about building up a free-lance journalism business. My main goal is to cover issues related to the environment and sustainability, using, but not exclusively, investigative journalism techniques such as data analysis and visualization.

Below it is possible to experience a pitch of mine about this project.

As I explain in the video, I would focus on the market of the media outlets, English  or Italian speaking, based in the European context. I am particularly interested in the new wave represented by new media start-ups, such as the Italian You-ng and Linkiesta, being good examples of web-focused publications and ones of the first italian media companies to be crowdfunded.

I am evalueting to focus on an European market because, even if it may result in a too big context, the niche and specific topic of Environment already make the market smaller, and considering that a lot of publications are used to write in English (for a small list see this website) in Europe, enlarging my possibilities to get more commissions. Anyway this means to have a good understanding of singular countries’ or cross-country issues, knowledge that I don’t have at the moment. 

The idea to do journalism out of a newsroom has frequently come across my mind during my experience as a local journalist, and I have always believed that the future of the news would have seen the journalists as indipendent from a particular media company and free to work on singular stories or projects. But doing some research I have stumbled on some problems regarding this kind of business to which I had to find answers in order to measure its viability:

  1. Is it really economically sustainable? a lot of people are actually struggling to work as freelancers and complaining about the general low payments and the often absence of a regular income. However, if on one side it is well know that a free-lancer life isn’t for old men, on the other one the layoffs in journalism are increased in Europe and US (for a better understanding have a look at this singular story), and part of this number turn free-lancers anyway. 
  2. What if I get an interesting full time job offer in the meantime? the good part is that if I get an interesting offer by a company I can carry on free-lancing with others. Otherwise I just can work on other energy-time consuming projects and still free-lancing without really renounce to do journalism.

The development of the idea have come slowly, week-by-week. From a few vague ideas about possible projects, I have decided to do something related directly to myself and how I put into practice the abilities that I am acquiring during this year. From this early stage of planning I have learned that being enterpreneurs means sometimes to change significantly your project according to its viability or the effective needs of the market.

Unlikely the final project would be the same that someone had in mind when started to plan. Some of the tools provided to help me on this process have been really useful, and functional to get more into the specific in some parts of the project. Indeed the combination mind-map/business canvas helped be to think outside the box before, in terms of partners, revenue streams, making the map as wide as possible, to then suddenly narrow every single part in chunks to have a more precise picture of the situation. This narrowing process may represent the essense of an entrepreneur activity, that should be innovative and forward-looking, but also able to keep every step in a sustainable and viable path.

The pitching tasks, although not so welcomed by me, have reveiled themselves like useful moments in the process. My reluctace derived from the fact that I wasn’t feeling ready to show something I was just planning to someone else until it was finished. But it has been indeed the same feedback from the other fellows that helped me redirect the free-lance activity towards the media outlests, rather that build a too miscellaneous kind of revenue compounded by different elements. This might result in something more sustainable in a shorter period. Additionally, repeating the same presentation of your activity in several pitches may be useful to spot some problems that you didn’t focus on.

As I said in the video my free-lance activity needs to be split in several steps. While I am still uncertain about the number of these, I have decided to firstly build up an on-line portfolio as a professional showcase and a blog. The blog would be the platform on which I can experiment, practice, try new kind of media to tell stories and engage with an eventual audience.

A next one would be the organisation of a sort of workshop for journalists or journalism students about environmental investigative reporting, inviting some top journalists in the field to speak about the practices of doing this kind of journalis. This may includes topics like the indicators to look at, the major official and unofficial sources, and so on. A good occasion to learn and, most importantly, to enlarge my own network.

I recognise that as a project it may not seem really entrepreneurial as this word is commonly accepted. As the entrepreneurial videomaker Adam Westbrook points out in this article, enterpreneurship is about wealth creation, while free-lancers still work hiring out their time to someone else.

Nothing really stops me to innovate the field at a certain point, and it is something that may come through the experience of the blog and the networking.

I’d really appreciate some kind of comments or feedback in case you have some for me. If you are a free-lancer and want to give some advises you are really welcome, or in case you think I am making a wrong decision please stop me now.

Embedding with wordpress: a failure

While I was writing this post about my experience with Tableau Public I went  across a problem: how can I embed something in a WordPress blog post? The most obvious answer come in my mind was to just copy and paste the embedding code in the html part of the post. Nothing appeared, excluding a “produced by Tableau Public” at the bottom of the post.

Googled the problem I have found plenty of people with approximately the same problem. In this forum, for instance, someone suggests to eliminate a part of the embedding code, or here the owner of the site decided to leave Wordpress, changing for Blogger. This is the reason why in that post I decided to insert an image of the Tableau dashboard. The problem should be related to an “i frame” plugin, downloadable just for the pro version of Wordpress. Nothing found for the on-line free one.

According to the W3Schools.com an “i frame” tag is “an inline frame used to embed another document within the current HTML document“.

Then, I had a go with Blogger, went in the html section, pasted the code and guess… it didn’t work neither. Researching in Google I have found pretty quickly the solution in a Jerome Cukier blog post which led me to the satisfactory victory. Pity that it is on the platform that I don’t use.

Should I change web host as well?