1. Introduction
In the last 20 years, scholarly research has benefited tremendously from the deluge of information available on the World Wide Web and other digital outlets. Nonetheless, the risk of unreliable or, worse, deliberately false sources has considerably increased, especially with the emergence of social networks as major avenues of communication to/from users and as news feeds, too, as the “Russiagate” and related problems for the Trump Administration demonstrated, or the “infodemia” of (more or less trustworthy) news about COVID-19. Scholars are not immune from such perils because they themselves rely increasingly on publicly available information and online databases. Hence, greater care and scrutiny are necessary to ensure that data and sources used to investigate social phenomena are valid, reliable and trustworthy. Data
triangulation may be a viable solution. Much as the orienteering method, the idea is to discover the “location” of a third point, by using two known relative positions, or, in this case, by comparing two different, independent sources of information on the same issues and seeing if their findings coincide, namely that they “say the same thing”. If possible, correlation coefficients, with all the necessary due attention, may give the researcher an actual measure of “how much” the two independent sources overlap, hence increasing confidence about their reliability. Incidentally, using different sources of data collected for diverse purposes to find creative solutions to problems is one of the linchpins for Big Data (BD) analytical methods [
1,
2].
Although triangulation has been hailed as a highly valuable methodology for the social sciences, it may be too plagued by scarcity of data or bad quality of the same, and indeed, the debate on whether or not triangulation is a sustainable alternative for the social sciences has been lengthy [
3,
4]. Denzin gives four examples of ‘triangulation’: (1) data triangulation, (2) investigator triangulation, (3) theory triangulation and (4) methodological triangulation [
1,
5]. In this paper, we would like to focus on ‘data triangulation’, which has three subtypes: (a) time, where observations are collected at different moments, (b) space, which requires different contexts, and finally, (c) persons. We thus examine data triangulation for different times and locations specifically for two databases reporting on drone strikes, namely one created by the New America Foundation (NAF) and the other by The Bureau of Investigative Journalism (TBIJ). We used the most updated datasets available as of the end of 2021. Unfortunately, after 2017, NAF stopped distinguishing between “air” and “drone” strikes for Yemen and Somalia, but kept the division for Pakistan, thus making it quite problematic for a comparative analysis of all the cases after the period 2017–2018. We nonetheless believe that the methods suggested and explored in this paper are quite useful examples for social researchers who have to work with different datasets on the same problem.
Nowadays, unmanned combat aerial vehicles (UCAVs) such as the Predator have become a linchpin in the struggle against terrorist organizations, such that the number of studies focusing on drone strikes and warfare is particularly notable. Several such studies rely on databases generated by news reports [
6,
7,
8]. Furthermore, the ethical and political [
9] consequences of drone warfare for counterterrorism imply that scholarly research on such topics would have essential implications for public policy. Likewise, research questioning the dependability of sources and databases is also on the rise [
10,
11]. For these reasons, we decided to focus specifically on the two databases to see how ‘compatible’ they are with one another and if they indeed ‘say the same thing’. If these conditions were to be satisfied, then the results would be consistent with Denzin’s data triangulation, which, in turn, would further strengthen the validity of papers relying on those databases.
As is well known, validity and reliability are the two most important issues when it comes to datasets. While reliability concerns ‘the extent to which (…) any measuring procedure yields the same results in repeated trials’, validity pertains to ‘the crucial relationship between concept and indicator’. Both are, nonetheless, a ‘matter of degree’ [
12]. These issues are critical in the drone debate where think tanks, NGOs and the US government provide significantly different data about the casualties. We are fully aware that when stretching the boundaries of probability samplings, authors need to be very careful with the generalizations they reach based on these databases. On the other hand, these datasets are the only viable sources to investigate counterterrorism drone campaigns.
In this context, reliability is problematic, as common methods to measure it (e.g., test/retest, split-half) can hardly be applied. This is, nonetheless, a frequent obstacle in social science fields such as security studies or international relations, much less, for example, in clinical psychology. For the two databases considered here, the ultimate sources of reliability would be the quality, ethics and professionalism of the witnesses, observers and journalists who collected information and data on the strikes and their casualties and damage. Measuring reliability at such levels, however, is far beyond the scope of this paper.
Validity of the two databases (that is, if they indeed measure what they are intended to measure) can be managed with a greater degree of confidence. Of the three basic types of validity—content validity, criterion-related validity and construct validity—the first two have ‘limited usefulness in assessing the quality of social science measures’. Construct validation, on the other hand, has ‘generalized applicability in the social sciences’. More specifically, ‘if the performance of the measure is consistent with theoretically derived expectations, then it is concluded that the measure is construct valid’ [
12]. In the case of the two databases, studies in the relevant literature are those using news coverage for scholarly analysis that allows for a relatively positive reply to the expectation of whether the databases are valid sources to assess the impact of drone strikes.
As observed by Trochim, there are four possible options when assessing reliability and validity [
13]. The data are (a) both reliable and valid, (b) neither reliable nor valid, (c) valid and not reliable, or (d) reliable and not valid. Of these, the first is clearly the most preferable, and the second is the worst case. Between the third and the fourth (albeit both not without problems), it is preferable that data are valid and (relatively) reliable rather than fully reliable but not valid, thus not measuring what the researcher is interested in. In our case, the datasets were valid, as they measured the casualties in the countries concerned, but were not entirely reliable due to the inherent difficulties of building a dataset on news reports and media outlets [
11,
14]. As Franzosi noted, newspapers have long been relied upon by historians and social scientists in general as informal sources of historical data. The author defends the use of news reports as data sources because they often constitute the only available source of information, not all events or items of information are equally liable to misrepresentation in the press, and the type of bias likely to occur in mass media consists more of silence and emphasis rather than outright false information [
2].
The ultimate example of dataset building on collective news media is the GDELT Project, the largest, most comprehensive, and highest resolution open database of human society ever created [
15]. Another great example is the Global Terrorism Database, an open-source database including information on terrorist events around the world from 1970 through 2016 [
16].
4. Comparison and Discussion of Results
In this second part of the paper, we analyse and compare the data strike by strike to further assess the validity of the two organizations’ datasets. This process will lead to the elaboration of a ‘strike by strike’ matrix for every country, which includes all the information contained in the original datasets. The matrix methodology and results are presented in the following sections.
4.1. Methodology and Results
The comparative matrix building process started with the elaboration of the NAF and TBIJ datasets to a unique format. To achieve this goal, we calculated the militants’ data for the TBIJ dataset. The chosen format contained ten columns for the strike date and location, the total results, the militants’ data, the civilians’ data, the unknowns’ data and the link to the strike web page. In the case of the TBIJ Pakistan and Yemen datasets, we kept the columns related to the ‘Area’ and ‘Province’ of the strike. Furthermore, we kept the columns referring to the organizations targeted in every strike included in NAF datasets. The comparison process took into account, first of all, the date and location columns and classified the strikes into three categories: identical, similar and different. The identical strikes were those pairs of events where date, location and numerical data coincided in both the NAF and TBIJ datasets. Similar strikes presented the same date and location but different casualty figures. Strikes registered as different were those that did not have a corresponding event in the dataset of the other organization. The date margin of tolerance was 1 day, and in the location comparison, a flexible approach was necessary due to the difficulties in locating different places and to linguistic issues with the geographical designation of various areas, villages and regions.
We considered places geographically close to the same location, taking into account the researcher’s difficulties in precisely locating the strikes. In some cases, strikes registered by the same organization and occurring on the same date and location were merged into a single strike. This operation was necessary only 13 times and in particular conditions to facilitate the comparison process. Merging the strikes was justified by the different methodologies used by the two organizations to catalogue strikes occurring on the same day at different times. The final stage of the process consisted of building a matrix for every country including the identical strikes registered singularly for different strikes by the NAF and TBIJ and the non-numerical information on the similar strikes. This matrix was the basis for the construction of many possible comparative databases depending on the criteria chosen for the elaboration of numerical data on similar strikes. Therefore, we could elaborate these data in different ways according to the purpose of the analysis.
4.2. Pakistan
The Pakistan comparative matrix (PAKO) presents the results shown in
Table 3. Regarding the methodology, for the strike reported on the date of 19 June 2004 by the NAF and on the date of 17 June 2004 by the TBIJ, we made an exception to the date margin of tolerance (The two strikes were the only ones reported by the two organizations in 2004). In 15 cases, two strikes with different dates were considered similar or identical, and the merger of multiple strikes was necessary eight times (see
Appendix C).
The comparison results report 72 identical strike pairs, 329 similar and 33 different events. Eleven different events derived from the NAF dataset and 22 from the TBIJ dataset. This fact reflects the higher Natt in the TBIJ dataset. Therefore, PAKO included 434 strikes and classified about 7.5% as different strikes. The data show how the two original datasets agree on date and location in 92% of the cases but only in 17% of their numerical data. The aggregate data for only the identical strikes are interesting for two reason: first, the unknowns’ data were null because the TBIJ did not include it, and second, the civilians’ data amount to three casualties. This aspect demonstrates the difficulties of finding agreement between the two organizations’ datasets in terms of the number of civilian casualties, and was related to methodological and terminological differences. It must be stressed that the criteria for the “Identical” category were very strict. For example, if we considered only the total values, the number of identical strikes would increase by 28, and more than half of the pairs of similar strikes would present a difference between the sum of the respective minimum and maximum total values that did not exceed 2.
4.3. Yemen
The Yemen comparative matrix (YEMO) did not differ from PAKO in methodology or format. We considered similar or identical pairs of strikes with different dates and merged multiple strikes into a single event only two times (see
Appendix C). In the TBIJ Yemen dataset, the location was missing in some cases while the province was always specified; in these cases, we considered only the province indicator for the comparison. The third line in
Table 3 shows the YEMO results. The main difference from the PAKO data was the number of different strikes that represented more than a third of the total. Fifteen of these strikes derived from the TBIJ dataset and 76 from the NAF dataset.
These data reflected differences in Natt only in part and reflected variations in data collection methodology. In fact, both the original datasets registered the UCAV strikes along with strikes conducted with other military tools. Moreover, the TBIJ Yemen dataset presented strikes where US responsibility was not confirmed. The inclusion of different types of strikes in the datasets caused a methodological issue highlighted by the high number of different strikes observed. Nonetheless, it is remarkable that almost 60% of the strikes were identical or similar, and more than a fourth were identical. The aggregate data related only to the identical strikes confirmed the issues seen in the PAKO case. The unknown data were null, and the two datasets agreed on a strike with civilian casualties only once, reaffirming the difficulties of finding agreement between the two organizations’ datasets in terms of civilian casualties.
4.4. Somalia
In the Somalia comparative matrix (SOMO), a date margin of tolerance was necessary in two cases, and we merged multiple strikes only three times (see
Appendix C). The comparison results, presented in
Table 3 (fourth line), diverged from those of YEMO and PAKO because more than half of the strikes were different, and only 17% found a perfect match. The strikes registered as different derived from the NAF in 13 cases and from the TBIJ in the other 12. The identical strikes data confirmed the issues for civilian casualty data that were null. In conclusion, SOMO was different from the other comparative matrices and reaffirmed the methodological issues previously highlighted.
5. Conclusions
Given the continuing attention to the drone debate [
20] and, in particular, the casualties data and the number of articles published on the subject, we consider it an important contribution to the field that two of the most relevant databases were thoroughly tested for coherence and concordance. In this respect, our analysis showed that the two databases were generally in agreement on the overall picture. In particular, the distribution over time of the number of strikes was highly consistent in the two databases and their civilian/militant ratios also showed similar trends. Furthermore, the comparison results presented strong concordance between the datasets. The comparison total results, presented in
Table 3, show the number of strikes by type and country. The most relevant aspect was the exiguity, in percentage, of the number of different strikes in PAKO. This indicates that PAKO was the matrix in which the NAF and TBIJ agreed more often on date and location. On the other hand, the number of identical strikes was higher in relative terms in YEMO, at 25%, against percentages of around 17% for SOMO and PAKO. The total results show that 20% of strikes were identical, 59% were similar and 20% were different. In conclusion, the two think-tanks agreed in 79% of the cases on date and location while presenting one in five identical strikes. The inclusion in a single matrix of all these strikes provides a more complete picture of the targeted killing campaign because strikes registered by one organization and not from the other have been included in the ‘Different’ category.
In the literature about triangulation between the four cases highlighted by Denzin [
1,
5], the one referring to the use of different databases has been given less attention, perhaps due to a longstanding issue of data scarcity at the time of writing (late 1970s). Today with plenty of digital data available and a growing number of news sources, the possibility of creating very large databases such as GDELT and GTD is becoming more common and user-friendly, as demonstrated by the field of Big Data analysis. It is indeed likely that triangulation by different data sources will become the most common approach. Avenues for future research include merging the two databases into a single one built using the comparative databases as a starting point. Furthermore, the development of new studies correlating strike data with other variables such as daily reactions in the media as captured by GDELT or the impact on the terrorist organizations is also likely. Clearly, the obstacles are still quite substantial. For example, in the two databases, even if they present similar trends, the possibility of eliminating systematic error is quite challenging, as we can see with the differences in total results comparing civilian casualties in
Table 1 and
Table 2.
As is already known, most empirical evidence of validity is correlational [
21]. Since the two databases show a significant number of identical and similar strikes, we can safely affirm that a satisfactory level of validity has been achieved. As for reliability, the final assessment is more problematic. While the literature on the reliance of using news coverage for research is quite established, we have to conclude that reliability is only partial, as highlighted by the 143 different strikes reported in
Table 3. Thus, more effort should be invested in this area by finding ways to (a) improve the level of detailed information in the databases and (b) comparing news reports with other data sources that may support (or reject) the conclusions reached. Using Trochim’s four alternatives for reliability and validity, we can assert that, for the two databases considered here, we have a ‘data are valid but relatively reliable’ case, that is, a ‘second best’ (preferable to ‘reliable but not valid’) scenario. In more general terms, this is an achievement to strive for. Naturally, it would be much more preferable to have the ‘valid and reliable’ assessment, but, for the social sciences, the second best is
the most a researcher often can hope for.
In our specific case considered here, the databases are likely to become even more central to any future research on drone strikes. In fact, as it seems, moving away from the greater transparency that had begun to appear during the last phase of the Obama administration, the Trump administration, as shown by a recent report [
22], moved toward more secrecy and thus more opacity with the data made available for this type of research. As the alternative would be to do ‘less’ research on drone strikes, a second-best outcome of ‘valid even if relatively reliable’ does not seem too ‘unpalatable’.
A final, general, lesson from the cases examined in this paper is that more data, of the most diverse types, available online (e.g., GDELT) may present a valued opportunity for social scientists to improve the overall quality of their research. Perhaps these are excessive expectations, as has already been the case in the past, or the social sciences are inexorably doomed to remain ‘imprecise’ by their own nature, but, as we mentioned in our introduction, (big) data could definitely be a game-changer by harnessing, for example, unrelated data to provide researchers with undiscovered insights into the problems at hand. Such a path will certainly increase the relevance of the validity/reliability issue, but it will also offer tremendous opportunities for new research avenues that would be too tempting to leave untried.