Next Article in Journal
Language Policy and Practices in an Ethiopian University towards Multilingualism
Next Article in Special Issue
Word Order in Colonial Brazilian Portuguese: Initial Findings
Previous Article in Journal
Galician Perfective Periphrases among Complex Predicates: Degrees of Grammaticalization and the Possibility of a Perfect Tense
Previous Article in Special Issue
A Diachronic Overview of the Prepositional Accusative in Portuguese
 
 
Article
Peer-Review Record

Reanalyzing Variable Agreement with tu Using an Online Megacorpus of Brazilian Portuguese

Languages 2024, 9(6), 197; https://doi.org/10.3390/languages9060197
by Scott A. Schwenter *, Lauren Miranda, Ileana Pérez and Victoria Cataloni
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Languages 2024, 9(6), 197; https://doi.org/10.3390/languages9060197
Submission received: 4 November 2023 / Revised: 15 March 2024 / Accepted: 19 May 2024 / Published: 28 May 2024
(This article belongs to the Special Issue Investigating Language Variation and Change in Portuguese)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

My first concern with this paper is that the authors make a number of broad and sometimes sweeping statements -- unsubstantiated claims -- that are not supported by research that is cited in the article. For example, comments like "datasets ... are not uniform in size or methods of collection", "individuals can heavily skew datasets", "informal conversation ... includes copious amounts of a highly frequent verb lexeme like ser but few others," "patching process by which native speakers simply add an -s to the forms". There is no research cited that backs up these statements. This study would be much more convincing without these unsubstantiated statements or by providing facts and citations that support the statements. I have marked many of these areas in the manuscript.

A second concern I have is that the authors go to great lengths in the introduction to explain how 2SG agreement/non-agreement is region-specific ("from rather sporadic in dialects of Rio de Janeiro to nearly exclusive usage of tu in the dialect of Porto Alegre"); however, there is no consideration of region in their analysis. Where are the speakers from in the online mega corpus they have used? Presumably from all over Brazil? If 2SG agreement is region-specific, then why hasn't the speaker's region of origin been factored into the analysis, if not as a fixed effect, then as a random effect? I see this as a major flaw in the analysis that should be addressed. Perhaps the authors held region constant in their data selection from the mega corpus, but they don't say how they controlled for this. Either way, the authors need to address how the speakers' regions of origin may be impacting the results.

A third major concern is how the authors control for genre and topic. The mega corpus they used covers a wide variety of genres (spoken, written, blogs, discussion, etc.) and topics (economics, finance, games, hobbies, etc.). The authors state: "we do not claim that the data analyzed are necessarily representative of either spoken or written BP." So what then does their study represent? No mention is made on how they controlled for this or whether all genres and all topics are lumped simply together. If the latter is true, then this could be a problem, particularly based on some of the prior research they cite, which has shown differences in 2SG agreement based on genre (spoken versus written).

A final concern I have is that no information is provided on the speakers in this dataset. Are they all of the same age or different ages? Genders? Educational levels? By mixing all of these social factors together, you don't really know what you have. Are the "400 tokens (100 per finite form) for each of the 20 verbs" spread evenly across the major social categories that have been shown to influence linguistic variation? Or are the data somehow skewed to one age group or one educational level? The authors leave us wondering about this. I realize that the authors have done a linguistic study and not a sociolinguistic study; however, they are comparing their findings to those of other sociolinguistic studies, which is not an apples-to-apples comparison.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

The written English needs polishing. There are a number of awkward sentence constructions and some odd choices of words. These can easily be fixed up with a good editor.

Author Response

Our replies in BOLD

A second concern I have is that the authors go to great lengths in the introduction to explain how 2SG agreement/non-agreement is region-specific ("from rather sporadic in dialects of Rio de Janeiro to nearly exclusive usage of tu in the dialect of Porto Alegre"); however, there is no consideration of region in their analysis. Where are the speakers from in the online mega corpus they have used? Presumably from all over Brazil? If 2SG agreement is region-specific, then why hasn't the speaker's region of origin been factored into the analysis, if not as a fixed effect, then as a random effect? I see this as a major flaw in the analysis that should be addressed. Perhaps the authors held region constant in their data selection from the mega corpus, but they don't say how they controlled for this. Either way, the authors need to address how the speakers' regions of origin may be impacting the results.

We cannot control region using the PtTenTen corpus. However, as we state, the results of other research by region are extremely variable, and there are no clear patterns except for "Region X uses more agreement than Region Y" and more agreement can mean 10-60% agreement, which is a huge range. Our point is that by not controlling for other internal linguistic factors (verb, finite form, construction type, frequency), these regional findings can be extremely misleading.

A third major concern is how the authors control for genre and topic. The mega corpus they used covers a wide variety of genres (spoken, written, blogs, discussion, etc.) and topics (economics, finance, games, hobbies, etc.). The authors state: "we do not claim that the data analyzed are necessarily representative of either spoken or written BP." So what then does their study represent? No mention is made on how they controlled for this or whether all genres and all topics are lumped simply together. If the latter is true, then this could be a problem, particularly based on some of the prior research they cite, which has shown differences in 2SG agreement based on genre (spoken versus written).

We did not control for region or topic, except in the case of 'crer' which we note came nearly exclusively from religious contexts. We state that this is a necessary control in future studies, however it was difficult if not impossible to do in the 2018 corpus. With the new 2020 corpus there is much more detail in the register/genre selections and it could be done by future researchers. Our results represent web-based BP, as we have tried to clarify in the text, and we think that these results (and the factors analyzed) need to be taken into account in future studies of the phenomenon.

A final concern I have is that no information is provided on the speakers in this dataset. Are they all of the same age or different ages? Genders? Educational levels? By mixing all of these social factors together, you don't really know what you have. Are the "400 tokens (100 per finite form) for each of the 20 verbs" spread evenly across the major social categories that have been shown to influence linguistic variation? Or are the data somehow skewed to one age group or one educational level? The authors leave us wondering about this. I realize that the authors have done a linguistic study and not a sociolinguistic study; however, they are comparing their findings to those of other sociolinguistic studies, which is not an apples-to-apples comparison.

This cannot be controlled in the corpus we used, but we used random sampling of the extracted verb forms in order to mitigate this effect. Once again, however, the results in prior research by speaker characteristics have been inconsistent and it is not clear that these characteristics account for the variation any better than the linguistic constraints that we analyze. So while we agree that it's not an apples-to-apples comparison, we submit that the failure of prior studies to include the internal linguistic factors is a flaw that needs to be corrected in the future.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper titled "Reanalyzing Variable Agreement with tu in Brazilian Portuguese" presents a comprehensive study on the variable agreement patterns between the second-person singular pronoun 'tu' and verb forms in Brazilian Portuguese (BP). It diverges from previous research, which primarily relied on sociolinguistic interviews, by analyzing a vast array of data from an online megacorpus. This approach allows for a more detailed examination of internal linguistic factors influencing these agreement patterns.

The study analyzes 4,860 instances of 'tu' followed by a verb, revealing that non-agreement with the third-person singular (3SG) verb form is significantly more common, with second-person singular (2SG) agreement being relatively rare. The research highlights that individual verb lexemes exhibit markedly different rates of agreement and non-agreement. Factors like specific tense/aspect/mood forms, verb lexeme frequency, and the verb's role as a main or auxiliary verb are also significant in affecting variation.

The paper situates this study within the context of other Romance linguistics studies, demonstrating how individual verbs and constructional patterns can significantly influence morphosyntactic variation. The research underscores the importance of considering these internal linguistic factors in future studies on this phenomenon.

For improvements, the paper could benefit from:

  1. Introduction:

    • Include a brief overview of previous findings on variable agreement with 'tu' in other Romance languages for comparative context.
    • Clarify the research questions and objectives early in the section for better reader engagement.
  2. Methods:

    • More detailed explanation of the selection criteria for the online megacorpus to establish the representativeness of the data.
    • Elaborate on the limitations of the corpus analysis method, including potential biases in online data.
  3. Results:

    • Include more detailed discussion of outliers or unexpected findings in the data.
    • Enhance the visualization of data for clarity, perhaps through more comprehensive charts or graphs.
  4. Discussion and Conclusions:

    • Expand on the implications of the findings for understanding language variation and change in Brazilian Portuguese.
    • Discuss how these results could influence language teaching or prescriptive grammar approaches in Brazil.
    • Suggest specific areas for future research based on the gaps identified in this study.

General Suggestions:

  • Ensure consistent terminology and definitions throughout the paper.
  • Include a more critical analysis of how the study's findings align or contrast with existing literature.
  • Enhance the clarity and conciseness of the writing style, especially in the introduction and conclusion sections.
  • Consider adding a section or discussion on the socio-cultural implications of the findings, if relevant.
  • Include a more thorough examination of potential methodological limitations and how they might affect the interpretation of the results.

Author Response

We have revised the paper in accordance with the suggestions of the reviewer.

Back to TopTop