The Panama Papers have become a worldwide phenomenon. Contents aside, the technical aspects of the related investigations, involving 2.6 terabytes of data -- 11.5 million documents! -- are intriguing. How did journalists manage, organize and analyze this huge amount of data?
The International Consortium of Investigative Journalists (ICIJ) over the past year established the technical foundations for the worldwide mass media investigation. The ICIJ also laid the technical grass roots of the "offshore leaks" involving international tax-shelter accounts in 2013, the "Lux Leaks" of Luxembourg's tax rulings in 2014 and the "Swiss Leaks" of cash shelters in Switzerland in 2015, and so has a lot of experience in data analytics for journalistic purposes. Mar Cabra works at the data and research unit at the ICIJ and knows the Panama Papers' technical challenges better than anyone else. I had the possibility to talk with the Spanish journalist a few days ago.
Computerwoche: The Panama Papers involves the biggest amount of data journalists have ever worked with in one research project: 2.6 terabytes of e-mails, pdf documents, images, database sets. How was it possible to analyze the data and to make it searchable?
Mar Cabra: This is our fourth investigation based on a leak from the offshore world. We've learned throughout the years that in the beginning of each investigation like this we need to spend time understanding the data we have in front of us. So we spent a good couple of months understanding the data and its different formats to understand how we could process it. We used platforms that we used in previous investigations, but improved them to work with these amounts of data.
The first thing we knew was that we needed a platform to host all the documents. Unfortunately a third of the documents were images in PDFs or in TIFs. So we had to set up a complex processing chain in optical recognition to extract text from those documents. And then we basically indexed those documents and put them on a cloud platform that allowed us to search the documents from everywhere in the world.
At the same time we realized that we also had documents from the internal database of Mossack Fonseca which included more than 200,000 companies in tax havens in 21 jurisdictions. Therefore we also knew that we needed another tool to visualize the data. In that sense we actually decided to move that database into Neo4j and then feed the Neo4j database into Linkurio.us, a software that allowed us to visualize graphs very easily and see the connections between companies, beneficiaries, shareholders and all their addresses. Those were the two main platforms we had for the reporters to mine these 2.6 terabytes of information.
CW: Which parts of the work could be automated, which had to be done manually?
Cabra: Nothing was automatically done. I mean, processing 2.6 terabytes of information in 11.5 million files takes a long time. We had to spend many resources improving the platforms that we were using. One thing we have to have in mind at ICIJ when using software to help us with the investigation, is our very wide range of users.
On the one hand we have the journalists who are very good at their job, but are not good with technology. And on the other hand we have very tech-driven journalists, who know everything about encryption and computers and of whom some are even developers themselves. So with every tool, we need to cater to these audiences.
For the document search platform we needed something that just allows people to search in a search box -- like they do in Google -- but also allows more complex queries like searching for regular expressions and patterns like bank accounts, IDs, passports. Same thing with Linkurio.us and Neo4j.
The good thing about Linkurio.us is that it can visualize graph data very easily. Everybody can work on dots. So our journalists that aren't very techie can just click on dots and then several other dots and connections appear. They find it very useful because it's very intuitive. However Linkurio.us and Neo4j are integrated in such a way that the more advanced users can make queries in Cypher -- Neo4j's language -- which actually looks like "show me all the people connected to this person within two steps" or "show me all the persons who are connected to more than twenty companies." So that was very important for us -- to set up the platform in a way so that both types of journalists could work with it. We invested in one full-time programmer to improve the document platform and process the documents for one full year.
CW: What was the biggest challenge in the process?
Cabra: The processing side. We had to set up a very complex chain that would basically take documents and look at those to see if the machine could structure the text. If it couldn't it would send the documents to OCR for recognition and then it would send it to the index. We did this by parallel processing with 30 to 40 machines in the cloud. If we only had one queue of documents it would have taken us forever.
CW: You talked about improvements of the platforms. How did you improve them?
Cabra: Since there are so many documents in many different formats, it was something special for ICIJ. For example some journalists wanted to have a feature that allows them to feed the platform with a list of names from their country and get a list of those out who are part of the documents. So we developed this feature, "batch searching." You put a spreadsheet of names in and a couple of minutes later you get a result list out. We had to improve the tools with features that we didn't need in previous investigations.
CW: How many people were and are connected to the Panama Papers project at ICIJ on the technical side?
Cabra: ICIJ has a very small staff of 12 people, a mixed team of programmers and journalists. Half of them -- six -- are in my team, the data and research unit. In this unit we are all involved in the processing of the data, but mainly there are three programmers who take care of the technical issues. One programmer was focused on the unstructured data in the documents, another one was focused on the structured data, Linkurio.us and Neo4j and the data analysis.
CW: What will be the next steps with the Panama Papers project from a technical point of view?
Cabra: We've made a big breakthrough on the stories. We exposed names in the public interest including names of hundreds of politicians in more than 50 countries. In early May we are going to release the names of more than 200,000 companies in tax havens that were incorporated by Mossack Fonseca. We are going to put that into the cloud on our website for everybody to use. Right now on ICIJ.org we have "offshore leaks" where we already put out the names of hundreds of thousands of offshore companies. We are going to add the Panama Papers data to that.
So in early May we will have this search engine on the website where anybody, from journalists to citizens to law enforcement to tax authorities, are going to be able to search these companies and the people behind them. We will also use our backend tools -- Neo4j and Linkurio.us -- to offer visualization. I believe this will be a big step because thousands of people will use this database.
CW: Why do you think that?
Cabra: Out of everything ICIJ has produced in the past years, our most successful product has been the offshore leaks database. Even today -- before the Panama Papers [are put online] -- the most viewed thing that we have on our website is the offshore leaks database. I am pretty confident that we are going to have a big success with the database because now we have even more eyes and more people interested in this offshore data. Therefore I think that people will use it a lot. I know for a fact for example that with the offshore leaks data we released in June 2013, tax authorities used it. For example in South Korea they recovered millions of dollars of unpaid taxes just by using the data that we released publicly. So I do foresee that a lot of people are going to be interested in the end.
This story, "Panama Papers: Soon searchable by everyone thanks to the cloud" was originally published by Computerwoche (Germany).