2. Literature Review
From an economic perspective, the creation of scholarly works can be viewed as a production process. Inputs include human capital, physical capital, financial capital, and labor, and the outputs are scholarly publications. Researchers not only want to produce publications, but to do so as efficiently as possible. This means producing publications with minimized costs for each constraint (e.g., labor, financial, etc.). Collaborations allow researchers to increase available inputs and share the various costs of production, thereby limiting potential barriers to producing and publishing research. In fact, scholarly publications and citations thereof exhibit an economy of scale with respect to the number of different nations contributing to the work [
2]. This means that as the number of different nations contributing to a research project increases, the impact of that project, as measured by citations, increases.
Collaboration in research and development (R&D) is known to ease the financial constraints involved [
3]. In addition to the importance of funding, Katz and Martin [
4] provided an extensive list of other contributing factors which included increasing scientific recognition and popularity, increasing specialization, increasing desires to work in interdisciplinary environments, and increasing labor constraints. In the context of publishing research in open access journals, this would also include easing the financial burden of paying publication fees. Collaboration offers the potential to ease all of these constraints, from labor to intellectual to financial.
In addition to funding from governments and non-governmental organizations (NGOs), grants and other funding from the private sector are becoming increasingly common in academia, especially in Organisation for Economic Co-operation and Development (OECD) member nations [
5]. In addition to peer-reviewed publications in scholarly journals, funding from the private sector also leads to increases in patents and research that supports positions held by industry [
6,
7]. Even though there are a few cases of journals reverting from open access to subscription models [
8], the importance of the financial constraint is only magnified when considering the rapid growth of the open access publishing model in the current century [
9].
In the context of international collaboration, increased national R&D expenditures are known to lead to increases in international collaboration in scholarly publication [
10,
11]. However, increased national R&D expenditure is not the only important factor in forming international collaborations, especially in developing nations. While open access publications can be viewed freely, a sufficient level of infrastructure is necessary in order to reap this benefit. In an economic framework, this would be the physical capital component. Internet access in developing nations is critical to disseminating knowledge in general [
12], and could prove critical in acquiring, preserving, and disseminating indigenous knowledge specifically [
13]. This lack of infrastructure is partially responsible for the non-uniform collaboration tendencies across Africa [
14].
In addition to the financial and physical capital constraints, human capital is also a crucial component which motivates the formation of international collaborations. This would include the sharing of ideas, which is fundamental to research and is a large motivation of the formation of collaborations, and, in many cases, the sharing of data and software. Open source software is a key component of open access and the reproducibility of research [
15,
16] and is widely viewed as a critical part of knowledge dissemination in developing nations [
17]. Additionally, overcoming the bias against non-native English speakers that exists in the peer-review process is an additional possible motivator for non-native English speakers to form international collaborations [
18]. Another motivation for developing international collaborations comes from what is known as the reverse brain drain, where students from developing countries graduate with their PhDs from universities in developed nations and return to their native countries. The effect of the reverse brain drain has been an increase in the formation of international collaborations [
19]. Consequentially, these effects have been generalized into the notion of brain circulation, in which ideas circulate with researchers as they maximize their own welfare as individuals [
20]. Throughout this paper, publications can be viewed as a proxy for the welfare of researchers, since career advancement is, in the current paradigm, inexorably linked to publications.
Studies discerning which nations form international collaborations have shown that international collaborations on scholarly publications are increasing rapidly [
21]. Exemplarily, China has experienced exponential growth in papers coauthored with G7 nations [
22]. Elsewhere in Asia, most other nations sought collaborations primarily with developed western nations, including the USA [
23,
24], with India in particular being no exception [
25]. Several network analyses have also concluded that the relatively more developed western nations form the core of the international collaboration network [
26,
27,
28]. There is evidence that the increase in international collaboration found its beginnings in the sciences [
29,
30,
31,
32], although more contemporary studies show that the patterns of growth in international collaborations are fairly similar across disciplines [
33].
Far fewer studies of this nature that are constrained to strictly open access journals exist. This could partially be caused by the fact that open access journals are still not widely popular in developed nations [
34]. A recent study which analyzed some 5208 open access publications found that publications were typified by having a single author [
35]. While this result is no doubt accurate, the small sample size (5208) and the fact that more than half of the papers included had the first author from the USA combine to cast doubt as to whether this sample was truly representative of open access publications in general.
3. Data Collection Process
The raw data we obtained consist of bibliographic data from papers published in MDPI journals. In particular, we wanted to collect data consisting of the national affiliation (of the research institution) of each author. As a result, the data in their rawest form consist of the year that the paper was published, the journal in which it was published, and the complete affiliation data for all authors of the paper. These data were then used to create a data set consisting of the number of papers containing a collaboration between two countries. As an example, if a given paper was authored by at least one researcher from each of Canada, Mexico, and the United States, then we would acknowledge a collaboration between Canada and Mexico, a collaboration between Canada and the United States, and a collaboration between Mexico and the United States. This data set is available for each year as well as for the entire time period from 1996 to 2018. This data set was then analyzed in further detail to answer specific questions for this paper.
The process of creating this data set consisted of four major steps: (1) Locating all papers published in MDPI journals, (2) collecting the bibliographic data from each of these papers, (3) cleaning the raw data, and (4) organizing the data. Before describing the data collection process, we mention again that all of the software developed to create this data set, along with all of the data, is available under the “2018_Initial” directory of the repository located at
https://github.com/cat-astrophic/MDPI. All scripts are freely available for consumption and may be modified to fulfill different research requirements.
The first step in the process of creating this data set, locating all papers published in MDPI journals, was achieved by using the script MDPI.py. This script parses each page of search results from a search query for all papers published in MDPI journals through the end of 2018 and returns the complete set of urls for all papers as a list. The search query returned 194,049 results, i.e., 194,049 papers published in MDPI journals as of the end of 2018, and the script MDPI.py did indeed return a list containing all 194,049 urls.
Once all of the urls were obtained, the next step was to scrape the bibliographic data from each paper. This was done using the script MDPI_scraper.py. This script goes to each url, identifies the year of publication, the journal in which the paper was published, and the affiliations of all authors, scrapes these data, and stores them in a single data frame. This data frame is then saved as a csv file (which is available in the aforementioned repository as papers_all.zip). Of the 194,049 papers published in MDPI journals through the end of 2018, MDPI_scraper.py returned consumable data for 190,602 of the 194,049 papers (a 98.22% success rate).
The third script, MDPI_annual_data.py, creates data frames for each year containing all data scraped by MDPI_scraper.py from that year. The purpose of doing this is to create smaller data files that are still easily used by personal computers that may have little RAM, especially considering that researchers may be interested in only a small subset of the years for which data were available (publications in MDPI journals date as far back as 1996). The data frames containing annual data are saved as csv files and are available in the repository collectively as papers_by_year.zip.
The fourth script, MDPI_interaction_matrix.py, creates a matrix indicating the number of collaborations between each pair of countries. In order to do this, the data from the 190,602 papers for which MDPI_scraper.py returned consumable data were cleaned and national affiliations were obtained. To ensure accuracy, this process relied heavily on manual inspection rather than using fuzzy matching methods or other matching techniques. Ultimately, bibliographic data for 190,186 papers were obtained. This reflects a 99.78% success rate for papers for which MDPI_scraper.py returned consumable data in step 2, and a 98.01% success rate for all papers published in MDPI journals as of the end of 2018.
Again, we mention that all code developed for the creation of this data set as well as all data files are available in the “2018_Initia” directory within the repository at
https://github.com/cat-astrophic/MDPI.
4. Income and International Collaboration
In this section, we study how income affects the decision of authors to form international collaborations. To do this, we adopt the Analytical Classifications system used by the World Bank [
36,
37]. This classification scheme classifies nations into four income groups based on per capita gross national income (GNI), and thresholds increase over time. The income groups are: High Income, Upper Middle Income, Lower Middle Income, and Low Income. Since the data collected cover the time period from 1996 through 2018, and since the World Bank Analytical Classifications are very consistent over time for individual nations, the income group classifications we use for nations are from the World Bank Analytical Classification for the year 2008, a midpoint in the data set.
The primary benefit of studying how income affects the development of international collaboration in the context of open access publishing is to determine whether or not the behavior of researchers changes with respect to income as economics suggests it should. Specifically, given that both research and open access publication are costly, economics tells us that researchers from less wealthy nations would have the most to gain from forming international collaborations because, on average, we would expect that they would be the least likely to independently finance the costs of research or the publication fees associated with publishing in open access journals. In the context of journals using subscription-based publication models, previous studies have shown that lower-income nations are more likely to collaborate with higher-income nations [
38]. Since the open access publication model only adds additional costs to the process of producing a peer-reviewed publication, we would expect that these results would, if anything, be strengthened in the context of open access publishing. In order to provide evidence that researchers from low-income nations form international collaborations more often than researchers from wealthier nations, we first study the frequency of the formation of collaborations by researchers in each income group.
Of the 190,186 articles from which we acquired bibliographic data, 137,786 (72.45%) had an author from a high-income nation, 24,403 (12.83%) had an author from an upper-middle-income nation, 54,412 (28.61%) had an author from a middle-income nation, and 2,573 (1.35%) had an author from a low-income nation. Due to international collaborations which include nations from multiple income groups, these percentages exceed 100%. These observations are presented in
Figure 1 with annual counts provided in
Table 1.
When considering the financial constraint in open access publishing, we might expect that the upper-middle-income group would have contributed to more papers than the lower-middle-income group. This is not the case primarily due to the population difference between the two groups. In particular, China is labeled as a lower-middle-income nation by the World Bank Analytical Classifications system. This means that the collective population of the lower-middle-income group is much larger than that of the upper-middle-income group, as well as that the lower-middle-income group includes a nation (China) which has invested heavily in R&D. To help clarify the magnitude of contributions from each income group, we provide in
Table 2 the collective population, total number of papers published in MDPI, and a per capita measure of MDPI publications by income group using 2018 population data from [
39].
Table 3 provides a complete list of which countries belong to which income group as a reference for the reader. Once the massive population share of the lower-middle-income group is accounted for, we see that publications per capita decrease with decreases in income. However, largely due to the contributions by Chinese scholars, the gap in per capita publications between the high- and upper-middle-income groups is much greater than the per capita publications gap between the upper-middle- and lower-middle-income groups.
When analyzing the number of different nations represented in papers published in MDPI journals, we find that most papers (75.18%) have authors from only one country. This result confirms what was found with a small sample of papers published in open access by [
35]. This behavior is also consistent with what is observed in journals with subscription-based publishing models. This provides strong evidence that researchers view open access journals generally, and MDPI journals specifically, as substitutes for journals using subscription-based publishing models. Additionally, the long-term trend of increasing international collaboration observed in journals with subscription-based publishing models is observed in all four income groups in this study, as seen in
Figure 2.
The distribution of the number of different nations collaborating on papers is summarized in
Table 4 along with summary statistics in
Table 5 and histograms for all nations (
Figure 3), high-income nations (
Figure 4), upper-middle-income nations (
Figure 5), lower-middle-income nations (
Figure 6), and low-income nations (
Figure 7). Note that the percentage of papers with one nation represented is higher for the entire data set than it is for any individual income group. This is because there is no overlap between papers with only one nation represented across different income groups, while there is such an overlap across income groups for papers with two or more nations represented. This is a consequence of the existence of international collaborations with members from different income groups.
As can be seen from the summary statistics and histograms, researchers from low-income nations form international collaborations in what appears to be a significantly different manner. Specifically, it appears as though researchers from low-income nations form international collaborations more frequently than researchers from relatively wealthier nations. This provides evidence in support of the main results presented in this paper, that income is an important factor in the formation of international collaborations, and specifically that researchers from low-income nations are more likely to form international collaborations.
Using the distribution of the number of different nations collaborating on papers which include at least one author from a nation in a given income group as a measure of how researchers in each income group form international collaborations when publishing in MDPI journals, we can statistically test the hypothesis that researchers from low-income nations form international collaborations more frequently than researchers from relatively wealthier nations.
To test this, we use a series of Kolmogorov–Smirnov tests to test the null hypothesis that the distribution of the number of collaborating nations per paper does not vary across income groups. The Kolmogorov–Smirnov test is a non-parametric test which tests the hypothesis that two distributions are equal by comparing their estimated cumulative distribution functions. We perform pairwise Kolmogorov–Smirnov tests for all possible pairs of income groups for each year from 2014 through 2018 and reject the null hypothesis if the test produces a significant difference at the 5% level. We choose 2014 through 2018 as our time frame because these are the only years in which low-income nations contributed to at least 100 papers published in MDPI journals.
The results from each year between 2014 and 2018 are consistent and indicate that researchers from low-income nations collaborate more frequently than researchers from wealthier nations. In the remaining three classifications (high income, upper middle income, and lower income), we find that there is no significant difference in the distribution of nations collaborating on papers published by researchers. These results are summarized below in
Figure 8. Black squares represent rejections of the null hypothesis that the distribution of the number of collaborating nations per paper does not vary across a given pair of income groups in that year. The results are robust to the removal of papers with greater than 20 nations represented, with the lone exception of the distributions for upper-middle- and low-income nations in 2016. We do mention that the distributions for upper-middle- and low-income nations in 2016 were significantly different at the 10% level during the robustness checks.
To depict the collaboration tendencies exhibited by each income group, we present in
Figure 9 a graph which shows both the difference in the volume of publications by income group and the differences in the volume of collaborations involving nations from different income groups.
We have shown that researchers in low-income nations form international collaboration more frequently than researchers in wealthier nations, but are these high-income, upper-middle-income, or lower-middle-income nations? To answer this question, we look at the income group of the most frequent collaborator of each nation. To make this analysis more palatable, we divide the world into six different regions and study each individually. The regions are: Africa, Asia, Europe, North America (excluding Central America), Oceania, and South and Central America.
We begin with Africa. In
Figure 10, we show the income group of the most frequent collaborator of each nation. Blue represents nations which most frequently collaborate with a high-income nation, green represents nations which most frequently collaborate with an upper-middle-income nation, red represents nations which most frequently collaborate with a lower-middle-income nation, pink represents nations which most frequently collaborate with a low-income nation, and white represents nations for which there are no data. No nations collaborated most frequently with a low-income nation over the time period 1996 through 2018. Interestingly, in 2018, the most frequent collaborator of Niger was Burkina Faso, a low-income nation. However, Burkina Faso is a geographic neighbor of Niger and both have French as an official language, thereby making collaborations between these two nations relatively easy to develop. That nations, particularly African nations, form collaborations in part based on geographic proximity and shared language and cultural aspects confirms the existing literature on international collaborations generally [
40,
41,
42,
43,
44,
45] and for Africa specifically [
14].
As can be seen in
Figure 10, the overwhelming majority of nations in Africa collaborated most frequently with a high-income nation. Given that many African nations are categorized as low-income, this provides support that low-income nations not only form international collaborations more often, but that they are also doing so with high-income nations, i.e., with researchers from nations which are most likely to help alleviate the financial constraint in open access publishing.
Next, we turn our attention to Asia. As can be seen in
Figure 11, most nations collaborate most frequently with high-income nations. However, China, a lower-middle-income nation, has a strong collaborative presence throughout Asia. With the exception of Saudia Arabia (who most frequently collaborated with Egypt) and Iran and Iraq (who both most frequently collaborated with Malaysia), all Asian nations that did not most frequently collaborate with a high-income nation instead collaborated most frequently with China. This even includes the high-income nation of Japan. These results are consistent with previous research that looked at international collaborations within Asia [
23]. However, given the rapid expansion of Chinese research over the past several decades, this result is not unexpected and, in fact, still supports the thesis of this paper, since these nations which collaborate most frequently with China are low- and lower-middle-income nations, with the lone exception of Kazakhstan, which is categorized as an upper-middle -ncome nation.
Moving on to Europe (
Figure 12), we again see that most nations collaborate most frequently with high-income nations. The exceptions to this were the UK (China), Bosnia and Herzegovina (Serbia), Montenegro (Serbia), Moldova (Romania), Ukraine (Poland), and Belarus (Russia). With the exception of the UK, all of these examples represent collaboration with a geographic neighbor. Those European nations which most frequently collaborated with high-income nations collaborated most frequently with either other European nations or with the USA.
In North America, which here consists solely of Canada and the USA, we see a story similar to that of Europe. Canada most frequently collaborated with the USA, and the USA most frequently collaborated with China. This latter result fits the existing literature, which has also provided evidence that Chinese nationals in the USA contribute to the formation of a significant percentage of collaborations between the USA and China [
46]. The results are depicted in
Figure 13. While Mexico and the remainder of Central America are typically counted as North American countries, we include them in the conglomerate Central and South American region, as these countries are all relatively similar culturally, economically, and linguistically.
Oceanic nations (
Figure 14) mirror North American nations in that New Zealand most frequently collaborated with Australia, while Australia most frequently collaborated with China.
Lastly, with the exception of Paraguay, every nation in South and Central America most frequently collaborated with a high-income nation. Paraguay, a lower-middle-income nation, most frequently collaborated with its (almost) neighbor Uruguay, an upper middle income nation. Consequentially, researchers in every nation in South and Central America collaborated most frequently with researchers from a nation that was at least as wealthy as their own. This can be seen in
Figure 15.
Now that we have provided evidence that researchers most frequently form collaborations with researchers from a nation at least as wealthy as their own, with the only exceptions being collaborations between geographic neighbors with cultural similarities, we provide three graphics which help visualize international collaborations in a geographic setting.
Figure 16 is a chart showing the number of papers published in MDPI journals which include at least one author from a given region.
Figure 17 is the regional analog of
Figure 9 superimposed on a map of the world with vertices weighted by the data presented in
Figure 16 and with edges weighted to show the relative frequencies of international collaborations that cross regional boundaries.
6. Conclusions and Future Directions
In this paper, we present the first analysis of a data set that contains data on international collaborations from all papers published in MDPI journals. Considering all of the various costs of publishing, e.g., financial and labor costs, and considering that MDPI journals are open access and have publication fees, economics suggests that researchers from lower-income nations would have the most to benefit from forming international collaborations.
Our analyses confirmed that this is indeed the case. By using a series of pairwise Kolmogorov–Smirnov tests, we confirmed that not only do national income levels factor into how researchers form international collaborations, but that researchers in low-income nations exhibit behavior that is significantly different from that of researchers in relatively wealthier nations. Specifically, researchers in low-income nations are far more likely to form international collaborations on papers that are published in MDPI journals.
Additionally, we saw that the overwhelming majority of nations most frequently form international collaborations with high-income nations. The few exceptions to this were either (1) nations which collaborated with geographic neighbors that shared immediately obvious cultural aspects such as language, (2) Asian nations which collaborated most frequently with China, or (3) high-income nations which collaborated most frequently with China. In all, researchers from 16 different nations chose to form international collaborations most frequently with China, the same frequency as for Germany and France. Only the USA was more often the most frequently sought nation for forming international collaborations, with 37 different nations collaborating most frequently with the USA.
We then used the six categories of research outlined in the Frascati Manual to determine whether or not this behavior is consistent across disciplines. We found that it is indeed the case that low-income nations form collaborations at significantly higher rates than wealthier nations, again by using a series of pairwise Kolmogorov–Smirnov tests. We also confirmed that publications in MDPI journals have many of the same characteristics as publications in other open access journals.
While these results are a fruitful beginning to studying the formation of international collaborations in open access publications, there are some limitations to this study that should be mentioned. First, we established that low-income nations are more likely to form international collaborations when publishing in open access journals than other nations. This, however, is not that same as saying that low-income nations collaborate more frequently than other nations. Since the topic of this paper was the formation of international collaborations, we did not consider the possibility of several authors or institutions from the same country being reflected on a given paper. The raw data collected in this project do provide the opportunity to ask questions about overall rates of collaboration, international or not. We again remind the reader that the data and all associated code can be found in the “2018_Initial” directory within the repository
https://github.com/cat-astrophic/MDPI.
Building directly on this research, it would be interesting to explore the role of income group and region in determining how frequently nations collaborate within the same income group or region versus different income groups or regions. All else equal, do nations tend to form international collaborations with high(er)-income nations in their own region?
Another interesting question building on the research presented in this paper would be to determine whether nations publish in open access (specifically MDPI) journals in proportion to their total number of scholarly publications across all journals, or if certain nations are more likely to publish in open access than other nations. If the latter is true, what factors motivate this behavior?
Throughout this paper, we consistently mentioned the article publication fees associated with publishing in open access journals, such as the MDPI journals which comprise this new data set. While these are only one of many costs in the production of academic publications, determining what effect they have on the formation of collaborations, international or otherwise, would be a significant contribution to the literature.
In addition to these potential future research directions, the data collected in this project present a perfect opportunity for network analyses, as well as the use of spatial econometric/statistical models for studies seeking to determine what socioeconomic factors influence the formation of international collaborations.