Science/Technology

USC/JPL scientist made big data stories possible on The Panama Papers

Chris Mattmann co-invented the software used to extract data that showed how some of the world’s richest people hide assets

May 09, 2016 Tarika Lall

Chris Mattmann is a principal data scientist and chief architect in the Instrument and Data Systems section at the Jet Propulsion Laboratory. Based in Pasadena, he is also director of the Information Retrieval and Data Science group and adjunct professor at the USC Viterbi School of Engineering. As a PhD student at USC, Mattmann co-invented the Apache Tika software used to extract data from The Panama Papers, bringing to light some of the ways in which some wealthy people, including global leaders, have hidden their assets.

Data scientist Chris Mattmann — Chris Mattmann teaches in the USC Department of Computer Sciences. (Photo/Dan Goods)

After getting his bachelor’s, master’s and PhD from USC, Mattmann now teaches classes in its Department of Computer Science. The great thing about the university, he said, is that computer science students can contribute to developing software and work on projects with real-world implications.

How did Apache Tika come about?

I developed Tika with Jérôme Charron while I was a PhD student at USC, taking [the course] “Search Engines” with Professor Ellis Horowitz in the mid-2000s. I initially got involved with a project called Apache Nutch, which set out to democratize online searches, so everyone could see the ranking algorithms demonstrating why they were seeing particular search results. I learned that a key component of a search engine is to identify file types [code, video, image, etc.] and to extract text and metadata [data about data — who created it, when, where, what the subject matter is, what language it’s in, etc.].

So the idea for Tika came about when Jérôme and I realized that many different software need to identify file types and extract data. We created it on the Apache platform so that it could be freely available to anyone who wanted to use it. Tika is a “digital Babel fish”: You can throw any file at it, and Tika will give you an understanding of the content inside. It is now used by companies like Adobe and FICO to mange their data.

Has this software evolved since it was first created?

With investment from organizations like DARPA and NASA, Tika has evolved from a software that provides basic data extraction to a software that can extract more complex data about people, places, dates and times from any file, including images and videos. From a word document, it can now identify people and their locations, down to the latitude and longitude. On the language side, machine translation is also being integrated into the platform: Not only will Tika determine which language the data is in, but will automatically translate it into your preferred language. These evolutions involve machine learning and Artificial Intelligence and will continue to be improved upon.

What are the intended uses of Tika?

The intended use of Tika is as a digital Babel fish. Because data is heterogeneous and has a lot of variety, Tika is a platform to simplify the data and get it into a common vocabulary. Many industries waste a lot of time sifting information from data, when they should be making money for their customers instead. Tika enables this process to become much more efficient. Tika is intended for use in search engines, content management systems, data forensics, scientific environments, finance systems and law enforcement — all systems that are challenged by the volume of data they manage.

Are there any privacy concerns that arise regarding software like Tika, particularly in light of the Panama Papers?

Tika does not violate privacy, but that might be done in the surrounding envelope because it does make data analysis easier. It is important to remember that Tika on its own does not collect people’s data, but because it is an open-source software, it can be used for questionable purposes, many of which I’m not a fan of, but we can’t control that. Easier access to information enables people to do good things with it and focus on the right things, but it can also facilitate misuse.

In the case of the Panama Papers, if the data were never made available, it wouldn’t have mattered that Tika was available. Nevertheless, independent of the actual cybersecurity leak, the Panama Papers are internationally significant and we should care about what people are doing to get around taxes, particularly when those people are our world leaders. It’s important that the media not ignore that.

What do you think some of the future uses will be for metadata analysis software and Tika specifically?

Metadata analysis software is still a very untapped resource. These software can reveal patterns in data that have great potential for new discoveries in science and medicine and advancements in law enforcement strategies. For example, terrorists will not be able to operate and collaborate as openly as they do today because posting pictures and information online will instantly be detected by software like Tika. Relationships in science might be revealed through images taken by the same instrument that can only be revealed by looking at the metadata and not by processing the images. On an everyday level, software like Tika allow for better and more efficient search engines and content management systems.