Read Time:35 Minute, 53 Second

By Dennis Nguyen, Erik Hekman, and Koen van Turnhout

Introduction

News media reporting can influence public perception of big data and artificial intelligence (AI): it informs audiences about innovations, opportunities, conflicts and challenges associated with datafication and automation. For example, the news cover how data-driven technologies empower consumers with novel devices (The Guardian 2020) or drive progress in medical treatments (Wired 2020a). News media also report about data scandals such as Cambridge Analytica (New York Times 2020) or discrimination in algorithmic systems (Vice 2020). The latter stories exemplify threats to an inclusive and fair digital society, such as privacy intrusion, exclusion, and discrimination. These data risks are related to critical questions about data collection, data analysis, data ownership, data regulation, accountability, power, control -or in short ‘data justice’ (Dencik et al. 2019; Taylor 2017). From a normative perspective, news media play an important role as public informers about such issues of societal relevance.

As big data and AI transform society in virtually all its domains, an informed citizenry should ideally have a basic understanding of them (Carmi et al 2020). Critically covering the digital transformation then becomes part of news media’s “democratic mission”, at least in liberal-democratic political systems. News media can shape perceptions, understanding, and attitudes toward technology (Groves et al 2015) through agenda-setting and news framing.

Research Motivation and Research Questions

The present study explores news frames for big data and AI and assesses to what extent data justice emerges as a discernible frame and indicator for critical tech news coverage. News framing studies reveal how issue-specific discourses address related topics and establish interpretative frameworks through problem descriptions, causations, and evaluations (Nguyen 2017; Kohring and Matthes 2008; Entman 1993). Analysing news media discourses provides an empirical basis for normative criticism on discursive practices (e.g. biases, reductionism), describing cultural formations, exploring social configurations, exposing hierarchies, and assessing how “meaning” of issues is assigned and contested. Lindgren (2020: 117) argues: ‘[…]semiotic and linguistic practices form part of the construction of knowledge about social reality, and become component parts in defining views on the world[…]’. The news framing approach connects here and allows for analysing societal transformations, frictions, and conflicts triggered by datafication and automation. 

This is still a nascent research area. The few studies available investigate how news media describe the complexities of datafication in words and visuals (Paganoni 2019; Pentzold 2018; Pentzold and Fischer 2017). Research is limited on how big data and AI are subject to news framing in respect to their different thematic contexts, uses, and risks. This is the focus of the present study: it charts news frames for datafication and automation over time with focus on data risks and ethical issues as indicators for public discussions of data justice. Michael and Lupton (2015) see datafication and automated decision-making processes as inherently political and connected to the question of ’[w]ho are regarded as trustworthy sources of big data and credible commentators upon, and critics of, them’ (Michael and Lupton 2015: 105)? This puts spotlight on news media coverage of data-driven technologies, where debates over definitions, meaning, relevance, and impacts happen. The critical analysis may reveal potential blind spots and imbalances in tech reporting. Insights from frame analyses can also raise regulators’ and tech creators’ awareness for how technologies make an impact on society. Finally, it probes what and how much lay audiences can read about technological trends, including challenges for inclusion and fairness.

The empirical part connects an exploratory content analysis with a quantitative manual content analysis and eventually an automated content analysis (ACA, utilising Named Entity Recognition) to trace news frames in a large volume of texts (13,465 in total). This considerably expands the scope of the analysis compared to previous studies. The purpose of the qualitative step is to identify relevant categories and indicators; the quantitative manual content analysis tests whether observations from the qualitative reading are intersubjectively reproducible. Insights from both inform the ACA to provide an overview of recurring topics and themes that indicate dominant news frames. The automated approach overcomes limitations of small samples, reduces resource-intensive manual coding, and circumvents effects of coder fatigue. The sample includes the full news coverage of big data and AI in four globally renowned British and American news media outlets: The New York Times and The Guardian as representatives of mainstream news outlets and Gizmodo and Wired as tech-focused news websites. Two research questions guide the study:

Research question 1: What frames do news articles on Big Data and AI convey in mainstream and special interest news outlets -and is data justice one of them?

Research question 2: What individual and collective risks (ethical challenges) do news articles on Big Data and AI point to?

The study addresses empirical questions about public understanding of the digital transformation and associated challenges: ‘What counts as big data? How do they emerge? Where are they being produced? By whom […] and by what […]?’ (Michael and Lupton 2015: 111). An overview of dominant news frames offers a foundation for critical analysis of public discourses on the societal impact of datafication and automation, especially in respect to data justice as a subject on public agendas.

Defining Data Justice

Dencik et al. (2019: 874) define data justice as the critical discussion of ‘democratic procedures, the entrenchment and introduction of inequalities, discrimination and exclusion of certain groups, deteriorating working conditions, or the dehumanisation of decision-making and interaction around sensitive issues’. This connects to ‘questions of power, politics, inclusion and interests, as well as established notions of ethics, autonomy, trust, accountability, governance and citizenship’ (Dencik et al 2019: 874). As big data and AI “run” on large volumes of data, both find themselves inevitably at the centre of these critical inquiries. Heeks and Shekhar (2019) propose to distinguish between five dimensions of data justice that specify contexts in which fairness and accountability in datafication processes become contestable:procedural, i.e. in the modes of data handling; instrumental, i.e. in the objectives of data collection and analysis; rights-based, i.e. regarding data regulation; structural, i.e. in terms of power relations in digital society; and distributive as ‘an overarching dimension relating to the (in)equality of data-related outcomes that can be applied to each of the other dimensions of data justice’ (Heeks and Shekar 2019: 995).

Taylor (2017) argues that data justice is a multidimensional “wicked problem” similar in complexity to e.g. poverty or climate change, that needs to be addressed ‘in a systemic way in order to deal with their interdependencies’ (Taylor 2017: 6). It affects individuals and groups (Taylor 2017: 8), with the weakest in society often suffering most. She pleads for an interdisciplinary approach to develop solutions to concrete social problems that arise from data injustice. The ubiquity and increasing necessity of data-driven systems may cause individuals to simply surrender and refrain from political resistance (Taylor 2017: 4, citing Turrow et al. 2015).

Critical research needs to facilitate concrete action that addresses power imbalances and promote the implementation of policies that focus inclusion. Taylor identifies a clear gap in the academic discourse: ‘[…]research and praxis on the ways in which datafication can serve citizenship, freedom and social justice are minimal in comparison to corporations and states’ ability to use data to intervene and influence’ (Taylor 2017: 2). It is important to acknowledge that there is a strong interdependence between the ‘public-private interface’ in datafication trends and to critically scrutinise it regarding the ‘corresponding implications for transparency and accountability’ (Taylor 2017: 4).

She further differentiates between three dimensions of data justice (Taylor 2017: 9-10): visibility, i.e. access to representation and informational privacy; engagement with technology, i.e. sharing data benefits and autonomy in technology choices; non-discrimination, i.e. the ability to challenge bias and prevent discrimination. Visibility concerns the datafication of subjects, representation through data, and the invasion of individual privacy. One example are debates on user rights in digital platforms. It also concerns the right to be included in datasets (Heeks and Renken 2018) in a fair and transparent way, i.e. to be considered as an equal part of a data-driven process that ideally offers benefits to the individuals or groups in focus. Lack of access to digital technology is one simple factor that can lead to exclusion. Engagement with technology addresses the structures and hierarchies of the data economy and the degree of individual autonomy in the usage of data-driven technology. Lastly, non-discrimination focuses on the reproduction of existing or creation of new forms of social exclusion and discrimination with data-driven systems; examples are algorithms that reproduce e.g. racist or sexist tendencies.

Using this map for data justice issues, the next step is to define concrete categories and indicators in relevant news content. This includes signals related to privacy violations, algorithmic biases, (mis-)categorisations, discrimination, exclusion, unfairness, exploitation etc. If news outlets address different data risks and ethical issues, a data justice frame should become recognisable through the content analysis.

Method and Data

Sampling and Data collection

The combination of manual and automatised content analyses charted tech reporting in two internationally renowned news media outlets and two tech magazines: The New York Times (NYT), The Guardian, Gizmodo, and Wired. The four outlets were selected based on their international scope and relevance for tech news coverage. The researchers wrote a Python script that automatically collected all relevant news articles for the two keywords big data and artificial intelligence from the four news sites. This yielded 13,465 articles in total (table 1), published between September 1993 and April 2020. The data set was scanned for articles that were listed for both keywords on a given news site. While some of these doubles exist, the overall frequency is negligible (less than 2.3% in total, table 2).

Table 1 Total Amount of News Articles per Keyword and Outlets (1993-2020)

OutletBDArtificial Intelligence
Guardian3071,049
New York Times1,2293,369
Gizmodo2061,715
Wired7174,873
TotalN1 = 2,459N2= 11,006

Table 2 Number of “Doubles” (1993-2020)

OutletNo. of “Doubles”Percentage of N
Guardian141.03%
New York Times1443.10%
Gizmodo140.73%
Wired1322.40%

The Combination of Manual Content Analyses and ACA

The main goal of the complementary content analyses was to identify frames “bottom-up”: from indicators or signals for specific categories/variables that in turn represent frame elements, which could be clustered to identify whole frames. Before this process was automated, two manual content analyses served for developing the dictionary for the ACA. First, an exploratory, semi-open, qualitative content analysis identified frame elements and indicators in 180 randomly selected articles. The inductive approach was semi-open, since the distinction between general frame elements (Nguyen 2017; Entman 1993; Kohring & Matthes 2008) provided orientation during the exploratory phase. These were: problem/issue definition (topics), causal attributions and risks, evaluations, and recommendations. For example, under problem/issue definition (topics), the researchers listed different business developments, political issues, technological innovations, and use cases. This first analysis resulted in over 320 codes or “signals” for the four general frame elements. 

The findings informed the development of a codebook for a quantitative manual content analysis. The 320 codes were clustered into 28 variables that operationalise the four general frame elements. Different signals from the exploratory phase such as “entrepreneurship”, “business opportunities”, “start-ups” and “acquisition” etc.  were clustered into the variable “business”; and “medical”, “diagnosis”, “cancer research” etc. were clustered into “health”. Both “business” and “health” are two different problems/issue definitions (topics). The codes were manually clustered and directly assigned as indicators for the respective main variable.

This reduced the complexity of the first iteration of the coding scheme and provided a foundation for the dictionary-based automated content analysis. The manual quantitative content analysis (N=50) had two purposes: 1) To check whether the identified variables allocated to a frame element were intersubjectively reproducible. This was considered necessary to increase validity and reliability before further translating the manual into a fully automated approach. 2) To collect more signals that represent a particular variable, since the researchers recorded what phrases indicated the presence of a given variable. This expanded the list of different signals that were already identified during the exploratory phase. Intercoder-reliability for two researchers reached KALPHA scores ranging from 0.70 to 0.80 for the 28 variables (Hayes & Krippendorff 2007). 

Figure 1: Method

The insights from both analyses were converted into a dictionary for the automated content analysis (Günther & Quandt 2015) of the full text volume with Named Entity Recognition (NER, Evans 2014). The identified keywords and -phrases were assigned as signals for specific variables that in turn indicated the presence of frame elements. NER is a method originally developed for automatically identifying persons and organisations in texts. It works with dictionaries that are “customisable” to register specific words/phrases that indicate a concept of interest. The coding unit was the whole individual article. However, the number of variables was reduced from 28 to 17 that represent three frame elements: domains (issues/topics), risks, and ethics.

The reason for this reduction was the ambiguity of certain variables and frame elements. It proved difficult to reliably identify causal interpretations and specific recommendations with the dictionary approach. The process included several tests and words deemed too ambiguous were dropped to reduce false positives. For example, “track” and “tracking” may indicate “surveillance” but can also describe a process of recording data in general. The final dictionary included over 600 keywords associated with the 17 variables that in turn represent three different frame elements (table 3). Framing analysis as a ‘theory-driven text analysis’ (Watanabe & Zhou 2020: 1) thus informed the design of the dictionary for the ACA. 

Table 3: Frame Elements, Variables, and Signals

Frame ElementVariableExamples for Signals 
DomainBusinessBusiness, economics, trade
 GovernanceGovern, regulation, law
 InnovationInnovation, research 
 PoliticsPolitical parties, elections
 MilitaryMilitaristic, army, 
 EducationSchools, students, teachers
 CultureMuseums, arts
 SportsFootball, basketball
 HealthHealthcare, cancer
 FinanceInvestment, banking
 ConsumerismShopping, traveling, retail
 Logistics &TransportationSelf-driving cars, shipping
 IndustriesManufacturing, oil, gas
 Technology Infrastructure5G, servers, sensors
 Technology Solutions Facial recognition, NLP
Data RisksCybercrime & Cyberwarcyber-attack, cyber-crime
 Information Disorderfake news, misinformation
 Surveillanceprivacy, privacy intrusion
 Data Bias Discrimination, racism, sexism
EthicsEthical ConsiderationsEthical, unethical
 Accountability/responsibilityAccountable, responsible
 Justice/FairnessInclusion, justice

A Python script clustered the articles based on the identified signals using hierarchical clustering. The process was iterative and tested different solutions for both binary (occurs/does not occur) data and total frequencies. Binary coded texts yielded clusters that were easy to interpret based on mean values per cluster (Kohring & Matthes 2008) for a four-cluster solution.

To assess the validity of the ACA, one researcher coded 80 randomly selected articles and compared the findings with the machine coding. Human-computer agreement for the sample stands at 90% for the categorisation of articles: in most cases the ACA/NER script correctly identified whether the articles covered e.g. politics, consumerism, risks etc. and the number of false positives was small (10%); the dictionary-approach worked as expected but it also proved too sensitive: it would spot signals in less relevant contexts (e.g. the use of the word “politics” in an article that is not focusing on politics). KALPHA scores for the individual variables range from 0.75 to 0.97, with an average of 0.83 for the human-computer comparison. While the human coder confirmed correct general classification of articles in most cases, he did not spot all relevant signals as accurately as the algorithm, which explains these differences on the level of variables. 

Results

Coverage of Big Data and AI over Time

Articles that mention big data or AI data first occurred in the 1990s with interest rapidly rising in the 2010s (figure 2). Most articles were published between 2015-2020 and the numbers for AI are almost 4.5 times larger than for big data (table 1). The results imply convergence across outlets regarding each buzzword’s newsworthiness: big data peaks from 2012 to 2015, while interest in AI increased considerably as of 2015.

Figure 2: Volume of Articles on “BD” 1995-2019 and “AI” 1993-2019

It seems focus shifted in this period, which was marked by milestones in AI development (e.g. AlphaGo, self-driving cars) and polarising discussions about the impact of automation on labour markets and human agency. The US election and Cambridge Analytica scandal may explain the slight increase of BD news coverage in 2017-2018.

The overall number of articles on datafication and automation is probably still comparatively low on mainstream news sites. For example, some estimates put the number of daily articles published by the NYT at 230 items, which would amount to ca. 84,000 per year (The Atlantic 2016). While the NYT has the second largest number of articles in the sample overall and increased its output of AI-focused stories over time (from 182 in 2015 to 644 for in 2019), Big data and AI have been mostly niche topics if judged by sheer frequency of occurrence. For comparison, the NYT published over 27,000 articles on “climate change’ between 1990 and 2020. However, an upwards trend is discernible. Especially AI, potentially encapsulating big data, is an emerging topic on news agendas and media attention steadily increased in recent years. While overall numbers on daily publications for the tech outlets were not available, Wired stands out here with an increase of AI articles from 536 in 2018 to 1,523 in 2019 and 1,111 in the first 4 months of 2020. 

Frames in BD and AI Coverage

News frames for BD and AI look similar across the four different outlets (tables 2 and 3). Most cover the digital transformation with focus on technological trends, economic potentials, various use-cases but also data risks and questions of governance. All these frames include a diversity of domains ranging from retail, health, education, research, industries to politics and militaristic contexts. However, business, consumerism, and finance are the most frequent domains in most clusters.

Table 2: Frames in BD News Coverage 1995-2020

OutletNFrames (Clusters)% N
Guardian307BD Risks, Governance and Politics (Data Justice)28.0%
Consumer Applications of BD26.7%
Economic Prospects of BD20.8%
BD in Health, Research and Education 24.4% 
    
NYT1,229 BD for End-User Services and Health23.4%
Commercial Applications of BD33.4%
Governance and BD Risks (Data Justice)18.7%
Economic Prospects of BD24.5%
    
Gizmodo206BD Innovation18.0%
BD Risks (Data Justice)26.7%
BD, Logistics, and Finance35.0%
Economic Prospects of BD 20.4%
    
Wired717Tech Infrastructure for BD 33,6%
Economic Prospects of BD28,3%
Governance and BD Risks (Data Justice)19,9%
BD’s Potential and Risks18,1%

The differences between consumerism and business opportunities/economic prospects concerns the level of technological impact: some articles report about user-centric services and products, others portray big data and AI as trends that reshape entrepreneurial activity and the economy. Examples for business opportunities and economic trends are stories such as “Robot Staff and Emoji Menus: How Hospitality went Hi-Tech” (Guardian 2016a) or “The Eight Technologies Every Entrepreneur Should Know About” (Guardian 2016b), while articles such as “Zoom Calls Are Less Boring With a Digital Twin of Yourself” (Gizmodo 2020) present stories on consumerism. Other clusters are dominated by articles that cover innovation (e.g., advances in tech development) and technological infrastructures (e.g., sensors, networks). Examples are “Facebook’s AI Is Now Automatically Writing Photo Captions” (Wired 2016) and “The Self-Driving Startup Teaching Cars to Talk” (Wired 2018). 

Table 3: Frames in AI News Coverage 1993-2020

OutletNFrames (Clusters)% of N
Guardian1,049Economic Prospects of AI 25.5%
Governance, BD Risks, and Ethics (Data Justice)21.1%
AI Solutions25.2%
AI Innovation and Consumerism28.3%
NYT3,369Governance and Data Risks (Data Justice)20.7%
AI Applications in Society30.5%
AI Innovations and Consumerism 27.2%
Economic Prospects of AI 21.5%
Gizmodo1,715AI and Data Risks (Data Justice) 20.0%
AI Applications in Society32.4%
Tech Infrastructure and AI Solutions 20.1%
AI and Consumerism 24.1%
Wired4,873Governance and Data Risks (Data Justice)19.4%
AI Applications in Society34.1%
Consumerism & Business Opportunities 20.6%
Tech Solutions 25.9%

All outlets address data risks and governance in a noticeable number of articles that form distinct clusters. In general, pitfalls, threats, and imbalances are raised across domains and often connect to references to regulation. Since most of these clusters consist of articles that mention individual and collective risks and raise questions of governance, they could count as different versions of a data justice frame.

The researchers identified four broader risk categories in the manual content analyses that in combination formed the frame element “data risks” for the ACA: 1) cybercrime & cyberwar, which concerns issues of cybersecurity such as hacking, DDoS attacks etc. for either criminal or terroristic/espionage purposes; 2) information disorder, i.e. misinformation, disinformation, and “fake news”; 3) surveillance, which covers forms of privacy intrusion; 4) data bias, which includes discrimination, exclusion, racism, sexism etc. Especially surveillance and data bias relate to data justice as proposed by Taylor (2017). Figures 4 and 5 show the results of the ACA that scanned the articles for signals of these data risks. More specifically, the bar charts illustrate how many of all articles per outlet include a reference to any of the four data risks (an article can mention several at once).

Figure 4: Data Risks in Big Data Articles

All types of data risks are to varying extents part of news reporting for both tech keywords. For big data, surveillance is the most frequently mentioned data risk across all news media: 26.9% (662 out of 2,459 articles) of the complete sample refer to the issue with some differences between outlets. On the Guardian almost 40% of articles that cover big data include references to surveillance/privacy issues, whereas on Wired this applies to 24%. For AI, surveillance occurs in 17.4% of all articles (1,915 out of 11,006) but the differences between outlets are less stark (figure 5). An example for articles that address surveillance is “Hey Siri! Stop recording and sharing my private conversations” (Guardian 2019). Depending on the domain, surveillance concerns either commercial and/or governmental forms of privacy intrusion and data collection. These often connect to threats of exploitation, manipulation or unfair and unlawful treatment by corporations and governmental organisations. 

Data bias issues occur in ca. 16% of all articles for big data and AI (395 and 1,781 articles, respectively); there is virtually no quantitative difference between the two keywords in the full sample. Differences between outlets are marginal in the case of big data but are more noticeable for AI coverage (figure 5). For example, data bias supersedes surveillance in the Guardian’s reporting with 25.4% of its AI articles raising related problems/challenges, while Gizmodo does so in 10.6%. Overall, the two mainstream news media appear to dedicate more space to the issue than the tech outlets in the AI context. Automated systems often focus on classification/categorisation, which can lead to stereotyping, discrimination, and exclusion. Examples are article such as “Who’s to Blame When Algorithms Discriminate?” (NYT 2019). The main threats are the expansion and intensification of e.g., racist or sexist tendencies in specific domains and the creation of new forms of discrimination and exclusion.

Security challenges related to cybercrime & cyberwar such as hacking, phishing, theft etc. are another noticeable risk. These threats either affect individuals, organisations, or are portrayed as actions against whole nation-states. An example headline is “China’s Hacking Spree Will Have a Decades-Long Fallout” (Wired 2020b). This risk category occurs to the exact same extent of 14.9% for both big data and AI articles in the full sample. Again, differences between news outlets are marginal for big data (figure 4) but are more pronounced for AI; for example, the Guardian addresses cyberwar & cyber security in less than 10% of its articles, while Wired makes references in 16.7%.

Information disorder occurs in far lower frequency when compared to the other data risks. It emerged as part of tech reporting around 2016 and the number of articles mentioning the issue steadily increased since. It is still a small topic with only 826 articles (7.5%) in the whole sample for AI and 115 articles (4.7%) for BD. 

Figure 5: Data Risks in AI Articles

Overall, data risks are clearly present in tech reporting (table 4), though there are noticeable differences between news media in the way they emphasise some risks over others. It cannot be stated that tech reporting is uncritical on any platform, if references to data risk serve as an indicator. However, not all articles that include data risks discuss them always as central topics and at great length; in some articles they are merely recited as additional information. They are visible and associated with both technology trends in diverse domains but the extent to which they are explored can differ considerably between articles.

Table 4: Data Risks in Big Data and AI Articles

OutletN1Risks in BD (% of N1)N2Risks in AI (% of N2)
Guardian30755.7%1,04943.9%
NYT1,22940.3%3,36943.5%
Gizmodo20749.3%1,71531.8%
Wired71742.1%4,87339.2%

All news outlets make direct references to ethics (table 5), often in co-occurrence with data risks (77% of articles on big data that mention ethics and 43.6% for AI). The Guardian stands out in the overall frequency of references to ethics but the differences between the other three outlets are marginal. An example headline for an explicitly ethics-focused story is “Why AI Is Still Waiting For Its Ethics Transplant” (Wired 2017); these articles usually address the lack of ethics in the design of data-driven systems, discuss the difficulties of implementing ethical guidelines, or make calls for taking ethics into consideration. Like data risks, articles that address ethics may merely mention that a given development has an ethical component to it without further elaborating on this.

Table 5: References to Ethics

OutletN1Ethics in BD (% of N1)N2Ethics in AI (% of N2)
Guardian30728.7%1,04929.7%
NYT1,22917.2%3,36922.8%
Gizmodo20723.2%1,71518.4%
Wired71719.1%4,87319.7%

As interest in AI increases, so seems more critical news framing. The Wired and NYT are exemplary here (figures 6 and 7): on both news sites Governance and Data Risks emerged as noticeable frames that gained visibility over time. For Wired, the cluster accounted for 5.8% of all articles in 2010 and that noticeably increased to 25.7% in 2019.

Figure 6: Frames over time Wired 2010-2020

In the case of NYT, it has become the most frequently occurring news frame by 2019; it accounts here for 34% of all articles, which is in an increase by almost 30% since 2010 (4.9%). Indeed, with the introduction of AI-driven solutions in diverse sectors over the past few years, incidents related to data risks became newsworthy items on news agendas. 

Figure 9: Frames over time NYT 2010-2019

Vocal debates on the role of governance and regulation related to such challenges would further explain this increase of more risk- and responsibility focused news reporting on data-driven technologies. However, there are only very few direct references to justice and injustice. For example, on Wired a mere two articles each refer to “unjust” and “injustice” and 39 to “justice” in its AI coverage, though it is important to keep in mind that these terms may not always refer to questions of fairness in data use. The General Data Protection Regulation (GDPR) occurs in 36 articles on Wired overall. A mere five articles specifically refer to data literacy, while data justice as a phrase does not occur at all. Both are academic terms that have not (yet) found their way into media discourses.

Conclusion

Concerning research question 1 ‘What frames do news articles on Big Data and AI convey in mainstream and special interest news outlets -and is data justice one of them?’: the findings of the distance reading show that big data and AI have an impact on diverse domains in society but that business-oriented angles seem to dominate most outlets’ reporting on data-driven technologies; these are related to economic potentials, user-centric products, and the commercialisation of innovation. All outlets address data risks and technology governance, which emerged as a noticeable frame across the sample. While the exact words “justice” or “data justice” are not part of news reporting on the digital transformation, the related challenges are visible in the media discourse. This can support critical thinking about emerging technologies among the public, which potentially contributes to the build-up of general data literacy in society. Before individuals can form opinions on data justice, they need to understand what datafication means and what the stakes are.

Data literacy includes a broader awareness for how digital technologies collect and process data. It includes a basic understanding of data-driven operations in public and private organisations and the risks this bears (Taylor 2017). The assumption is that news media discourses are at least to some degree influential in shaping this awareness. However, even though critical reporting appears on the rise, further research needs to explore how exactly data justice issues are “translated” for lay audiences (metaphors, responsibilities, social configurations etc.). A combination of framing analyses and critical discourse analysis can provide further insights into discursive practices.

Concerning research question 2 “What individual and collective risks (ethical challenges) do news articles on BD and AI point to?”: four recurring data risks emerged during the analysis: surveillancedata biascyberwar & cybercrime, and information disorder. Especially challenges related to privacy intrusion, control, and discrimination directly connect to Taylor’s (2017) proposal for operationalising data justice.

More critical analyses are needed on the media presentation of concrete cases of e.g. algorithmic discrimination and/or surveillance, to identify discursive practices and networks of stakeholders. It is important to consider that the automated distant reading does not allow for assessing the depth of critical reflection for most of the analysed articles. A clear limitation is that a dictionary-based approach is blind to latent meaning in text and only registers manifest signals. A qualitative discourse analysis can expand the close reading for exploring how exactly each type of data risk is being portrayed and given meaning. Insights won with the qualitative content analysis in the early steps of the present study provide a basis.

In addition, a network analysis of recurring organisations and individuals in the tech discourse can reveal opinion leaders and associated responsibilities. The data is easily retrievable with the ACA/NER method, but the respective analyses need to be addressed in a separate discussion that connects the networks to concepts of influence, responsibility, and power. Nevertheless, the findings from the ACA offer insights into the general visibility of normative challenges of the digital transformation in public discourses. It charted the main interpretative frames and showed that despite a diversity in topics, the overall framing is relatively similar across the different outlets and for both technology-buzzwords. 

List of References

The Atlantic (2016): ‘How Many Stories Do Newspapers Publish Per Day?’. Online accessible via https://www.theatlantic.com/technology/archive/2016/05/how-many-stories-do-newspapers-publish-per-day/483845/(accessed 21/11/2020).

Backstrand, K., Meadowcroft, J. and Oppenheimer, M. (2011): ‘The politics and policy of carbon capture and storage: Framing an emergent technology’, in Global Environmental Change 21(2), pp. 275-281.

BBC Online (2019): “Tesla Model 3. Autopilot Engaged During Fatal Crash’. Online retrievable via https://www.bbc.co.uk/news/technology-48308852 (accessed: 05/10/2020).

Brossard, D., Scheufele, D. A., Kim, E. and Lewenstein, B. L. (2008): ‘Religiosity as a Perceptual Filter. Examining Processes of Opinion Formation about Nanotechnology’ in Public Understanding of Science 18 (5), pp. 546-558.

Burscher, B., Vliegenhart, R. and de Vreese, C. (2016): ‘Frames Beyond Words. Applying Cluster and Sentiment Analsis to News Coverage of the Nuclear Power Issue’, in Social Science Computer Review 34 (5), pp. 530-545.

Carmi, E. & Yates, S. J. & Lockley, E. & Pawluczuk, A. (2020): ‘Data citizenship: rethinking data literacy in the age of disinformation, misinformation, and malinformation’, Internet Policy Review, 9(2). DOI: 10.14763/2020.2.1481 

Cuckier, K. and Mayer-Schonberger, V. (2013): BD. A Revolution That Will Transform How We Live, Work and Think. London: John Murray.

Cutcliffe, S. H., Pense, C. M., Zvalaren M. (2012):’ Framing the Discussion: Nanotechnology and the Social Construction of Technology–What STS Scholars Are Saying’, in NanoEthics 2, pp. 81-99.

Darling, K. (2015):’ “Who’s Johnny?” Anthropomorphic Framing in Human-Robot Interaction, Integration and Policy’, in Lin, P., Bekey, G., Abney, K., and Jenkins, R. (eds) (2017): ROBOT ETHICS 2.0. OxfordOxford University Press.

Delshad, A. and Raymond, L. (2013): ‘Media Framing and Public Attitudes Towards Biofuels’, in Review Policy 30 (2), pp. 190-210.

Dencik, L., Hintz, A., Redden, J., and Trere, E. (2019): ‘Exploring Data Justice. Conceptions, Applications and Directions’, in Information, Communication & Society 22 (7), pp. 873-881.

Du, Q., & Han, Z. (2019). The framing of nuclear energy in Chinese media discourse: A comparison between national and local newspapers. Journal of Cleaner Production, 118695. doi:10.1016/j.jclepro.2019.118695

Entman, R. M. (1993): ‘Framing. Toward Clarification of a Fractured Paradigm’, in Journal of Communication 43 (4), 51-58.

Evans, M. S. (2014). A computational approach to qualitative analysis in large textual datasets. PloS one, 9(2), e87908.

Forbes (2019): ‘Amazon Refuses to Quit Selling Flawed and Racially Biased Facial Recognition’. Online retrievable via https://www.forbes.com/sites/zakdoffman/2019/01/28/amazon-hits-out-at-attackers-and-claims-were-not-racist/#741a580f46e7 (accessed: 07/10/2020).

Garofalo, M., Botta, A., and Ventre, G. (2016): Astrophysics and BD: Challenges, Methods, and Tools. Proceedings of the International Astronomical Union, 12(S325), 345-348.

Gizmodo (2020): ‘Zoom Calls Are Less Boring With a Digital Twin of Yourself’. Online accessible via https://gizmodo.com/man-who-made-a-digital-ai-powered-twin-for-video-calls-1842705724 (accessed 10/08/2020).

Guardian (2020): ‘Fresh Cambridge Analytica Leak Shows Global Manipulation is Out of Control’. Online retrievable via: https://www.theguardian.com/uk-news/2020/jan/04/cambridge-analytica-data-leak-global-election-manipulation(accessed: 06/10/2020).

Guardian (2019): ‘Hey Siri! Stop recording and sharing my private conversations’. Online accessible via https://www.theguardian.com/commentisfree/2019/jul/30/apple-siri-voice-assistants-privacy. (accessed 21/11/2020).

Guardian (2016a): ‘Robot staff and emoji menus: how hospitality went hi-tech’. Online accessible via https://www.theguardian.com/media-network/2016/jul/13/robots-hotel-technology-hospitality-emoji-menus (accessed 21/11/2020).

Guardian (2016b): ‘The Eight Technologies Every Entrepreneur Should Know About’. Online accessible via https://www.theguardian.com/small-business-network/2016/oct/11/technologies-entrepreneur-small-business-blockchain-virtual-reality-drones (accessed 21/11/2020).

Gunther, E. and Quandt, T. (2015): ‘Word Counts and Topic Models’, in Digital Journalism. DOI: 10.1080/21670811.2015.1093270

Guzman, A. L. and Jones, S. (2014): ‘Napster and the Press. Framing Music Technology’, in First Monday 19 (10).

Groves, T., Figuerola, C. G., and Groves, M. A. (2015): ‘Ten Years of Science News. A Longitudinal Analysis of Scientific Culture in the Spanish Digital Press’, in Public Understanding of Science 25 (6), pp. 691-705.

Hartmann, P. M., Zaki, M., Feldman, N., and Nely, A. (2016): ‘Capturing Value from Big Data. A Taxonomy of Data-Driven Business Models Used by Start-Up Firms’, in International Journal of Operations & Production Management 36 (10), pp. 1382-1406.

Hayes, A. F. and Krippendorff, K. (2007): ‘Answering the Call for a Standard Reliability Measure for Coding Data’, in Communication Methods and Measures 1, 77-89.

Heeks, R., & Renken, J. (2018). Data justice for development: What would it mean? Information Development34(1), 90–102. https://doi.org/10.1177/0266666916678282

Heeks, R. and Shekar, S. (2019): ‘Datafication, Development, and Marginalised Urban Communities. An Applied Data Justice Framework’, in Information, Communication & Society 22 (7), 992-1011.

Holliman, R. (2004): ‘Media Coverage of Cloning: A Study of Media Content, Production and Reception’, in Public Understanding of Science 13, pp. 107-130.

Lakoff, G. and Johnson, M. (2003): Metaphors We Live By. Chicago: University of Chicago Press.

Lindgren, S. (2020): Data Theory. Cambridge: Polity.

Maciejewski, M. (2017): ‘To Do More, Better, Faster and More Cheaply. Using Big Data in Public Administration’, in International Review of Administrative Sciences83 (15), pp. 120-135.

Matthes, J. (2013): Framing. Baden-Baden: Nomos.

Matthes, J. and Kohring, M. (2008): ‘The Content Analysis of Media Frames. Toward

Improving Reliability and Validity’, in Journal of Communication 58, 258-279.

McStay, A. (2018). Emotional AI. The Rise of Empathic Media. London: Sage.

McAfee, A. and Brynjolfsson, E. (2017): Machine, Platform, Crowd. Harnessing Our Digital Future. New York: Norton.

McQuail, D. and Deuze, M. (2020): McQuail’s Media & Mass Communication Theory. London: Sage.

Michael, M. and Lupton, D. (2015): ‘Toward a Manifesto for the “Public Understanding” of Big Data”, in Public Understanding of Science, pp. 104-116.

The New York Times (2020a): ‘Cambridge Analytica and Facebook: The Scandal and the Fallout So Far’. Online accessible via https://www.nytimes.com/2018/04/04/us/politics/cambridge-analytica-scandal-fallout.html (accessed 20/11/2020).

The New York Times (2020b): ‘Google and the University of Chicago Are Sued Over Data Sharing’. Online accessible via https://www.nytimes.com/2019/06/26/technology/google-university-chicago-data-sharing-lawsuit.html?searchResultPosition=2 (accessed 20/11/2020).

The New York Times (2019): ‘Who’s to Blame When Algorithms Discriminate?’. Online accessible via https://www.nytimes.com/2019/08/20/upshot/housing-discrimination-algorithms-hud.html (accessed 21/11/2020).

Ngo, J., Hwang, B. G. and Zhang, C. (2020): ‘Factor-based Big Data and Predictive Analytics Capability Assessment Tool for the Construction Industry’, in Automation in Construction 110.

Nguyen, D. (2017): Europe, the Crisis, and the Internet. A Web Sphere Analysis. London: Palgrave MacMillan.

Paganoni, M. C. (2019): Framing Big Data. A Linguistic and Discursive Approach. London: Palgrave Macmillan.

Pentzold, C., Brantner, C. and  Fölsche, L. (2019): Imagining Big Data: Illustrations of ‘Big Data’ in US News Articles, 2010–2016. In: New Media & Society. 21(1), 139-167.

Pentzold, C., & Fischer, C. (2017). Framing Big Data: The discursive construction of a radio cell query in Germany. Big Data & Societyhttps://doi.org/10.1177/2053951717745897

Schutz, H. and Wiedermann, P. M (2008): ‘Framing effects on risk perception of nanotechnology’, in Public Understanding of Science 17 (3), pp. 369-379.

Shaikh, A. R., Butte, A., Schully, S. D., Dalton. W. S., Khoury, M. J., and Hesse, B. W. (2014): ‘Collaborative Biomedicine in the Age of Big Data. The Case of Cancer’, in Journal of Medical Internet Research 16 (4).

Smyrnaios, N. (2018): Internet Oligopoly. The Corporate Takeover of Our Digital World. London: Emerald.

Srincek, N. (2017): Platform Capitalism. Cambridge: Polity.

Taylor, L. (2017): ‘What is Data Justice? The Case for Connecting Digital Rights and Freedoms Globally’, in BD & Society, 1-14.

The New York Times (2019): ‘Google and the University of Chicago Are Sued Over Data Sharing’. Online retrievable via https://www.nytimes.com/2019/06/26/technology/google-university-chicago-data-sharing-lawsuit.html?searchResultPosition=2 (accessed: 05/10/2020).

Tucker, C. (2012): ‘Using Social Network Analysis and Framing to Assess Collective Identity in the Genetic Engineering Resistance Movement of Aotearoa New Zealand’ in Social Movement Studies, 12 (1), pp. 81-95.


Turow J, Hennessy M and Draper N (2015): The tradeoff fallacy: How marketers are misrepresenting American consumers and opening them up to exploitation. Epub ahead of print 2015. DOI: 10.2139/ssrn.2820060.

Wahl-Jorgensen, K., Bennett, L., and Taylor, G. (2017): ‘The Normalization of Surveillance and the Invisibility of Digital Citizenship. Media Debates After the Snowden Revelations’, in International Journal of Communication 11, pp. 740-762.

Watanabe, K. and Zhou, Y. (2020): ‘Theory-Driven Analysis of Large Corpora. Semisupervised Topic Classification of the UN Speeches’, in Social Science Computer Review. DOI: 10.1177/0894439320907027

Williamson, B. (2018): Big Data in Education. The Digital Future of Learning, Policy, and Practice. London: Sage.

Wired (2020a): ‘AI Can Help Patients but Only If Doctors Understand It’. Online accessible via https://www.Wired.com/story/ai-help-patients-doctors-understand/ (accessed 20/11/2020).

Wired (2020b): ‘China’s Hacking Spree Will Have a Decades-Long Fallout’. Online accessible via https://www.wired.com/story/china-equifax-anthem-marriott-opm-hacks-data/ (accessed 21/11/2020).

Wired (2018): ‘The Self-Driving Startup Teaching Cars to Talk’. Online accessible via https://www.wired.com/story/driveai-self-driving-design-frisco-texas/ (accessed 21/11/2020).

Wired (2017): ‘Why AI Is Still Waiting For Its Ethics Transplant’. Online accessible via https://www.wired.com/story/why-ai-is-still-waiting-for-its-ethics-transplant/ (accessed 21/11/2020).

Wired (2016): ‘Facebook’s AI Is Now Automatically Writing Photo Captions’. Online accessible via https://www.wired.com/2016/04/facebook-using-ai-write-photo-captions-blind-users/ (21/11/2020).

VICE (2020): ‘The Netherlands Is Becoming a Predictive Policing Hot Spot’. Online accessible via https://www.vice.com/en/article/5dpmdd/the-netherlands-is-becoming-a-predictive-policing-hot-spot (accessed 20/11/2020).

Vishwanath, A. (2009): ‘From Belief-Importance to Intention: The Impact of Framing on Technology Adoption’, in Communication Monographs 76 (2), 177-206.

Zuboff, S. (2019): Surveillance Capitalism. The Fight for a Human Future at the New Frontier of Power. London: Profile Books.

Kevin Wolf Previous post Research and art project investigating women’s perspectives on working in the tech industry
Photo by Markus Spiske on Unsplash Next post Data Literacy 4 Refugees – Empowerment through Knowledge