Unleashing Big Data’s Potential for journalism, economy and research

The extent to which Text and Data Mining is revolutionising the way both public and private sector researchers work has yet to be fully realised by EU policymakers, argue data mining experts.
Photo credit: Pixabay

Text and Data Mining (TDM) lets us make sense of the vast amount of data that is out there. Understanding this data is critical to advancing our knowledge in climate change research, breaking corruption scandals in the press, discovering breakthrough medical treatments and training computers to improve customers’ experience online.

However, there are concerns that the current reform of EU copyright rules could limit who gets to use TDM and how they get to use it.

In order to build strong scientific datasets, or to train our Artificial Intelligence (AI) algorithms, researchers need to gather data from a broad range of sources, including scientific publications to which we have acquired lawful access through licensing agreements, or data that is publicly available on the internet (and not behind a paywall).

We need to make sure that our right to read this data includes the right to understand and analyse it.

We want EU policymakers to understand that TDM is not about copying or re-using creative works without paying. TDM is about understanding the works we have legally accessed to identify patterns, facts, and correlations locked within these works, such as the tone of scientific or journalistic articles or how many times specific words are used.

TDM does not harm rightsholders. In fact, the more data analytics that take place, the more TDM users will request lawful access, increasing the demand for subscriptions to articles.

Many research projects are public-private partnerships. In fact, the European Commission’s Horizon 2020 programme – the largest research programme globally – envisages collaboration between public and private entities as they take “great ideas from the lab to the market”.

"TDM does not harm rightsholders. In fact, the more data analytics that take place, the more TDM users will request lawful access, increasing the demand for subscriptions to articles"

This programme usually requires that approved projects have another source of funding, typically private funding. If the private partner of a Horizon 2020 funded consortium cannot use TDM on the same basis as a public partner, this would greatly restrict the ability to fund AI projects at a time when such research is a critical element of growing the EU’s digital economy.

Journalists also do not qualify as non-commercial beneficiaries, yet today, they need to have tools to understand the increasing amount of information at their disposal.

TDM technologies have helped uncover crucial stories with significant impact on society and democracy, such as the Panama Papers.

"With the growing threat of fake news, which we know can be best tackled by algorithms and data analytics tools, we should not undermine the quality of journalism in Europe by raising unjustified copyright barriers"

With the growing threat of fake news, which we know can be best tackled by algorithms and data analytics tools, we should not undermine the quality of journalism in Europe by raising unjustified copyright barriers.

Being able to verify the data used in AI is critical to understanding and addressing errors, bias, and needed improvements. Building adequate datasets is the first step in conducting a TDM-based research project and this can take several weeks, sometimes months.

Access to datasets once the research is completed is necessary to verify any findings. But to do that, we need to be able to safely store incidental copies of the datasets on secure servers.

However, this is something that is not allowed in the copyright reform as it stands today. Without any backup information that would allow the public to verify research conducted in Europe, we risk losing citizens’ trust in science.

Like the European Commission, we have big ambitions for Europe when it comes to Artificial Intelligence. We also want Europe to lead the global AI agenda and adopt a future-proof copyright reform that will unleash big data’s potential for journalism, economy and research in Europe.

To achieve this ambition, we need an equally ambitious TDM exception.