Predicting the dative alternation in English:

A case study with logistic regression

Chiara Paolini

Outline

Preliminary steps
The case study

Preliminary steps

1. Know the linguistic phenomenon you want to analyse

…and how people how studied it before

Try to look for:

The pivotal study on the phenomenon you want to study
A couple of qualitative and quantitative investigations to get a sense of how researchers addressed it before you and what you can integrate to their research: a new approach, a new perspective, new data with a previous approach etc.

2. Know your dataset (and your data)

…especially if you did not build the dataset by yourselves!

Linguistic information:
- where are the observations extracted from? and when?
- do they belong to a specific language register?
- did the authors apply some restrictions/limitations to the dataset?
Statistical information:
- how many observations the dataset counts originally and after restrictions
- how many variables/predictors are annotated
  - Keep always an eye on how much the dataset change after you apply your specific filters

2. Know your dataset (and your data)

…especially if you did not build the dataset by yourselves!

Language internal/external predictors: identify and describe them.
- First thing First: Response variable, type of the variables, levels of the variables
- Filters added in previous studies (they can drastically change the results of your replication study!!)

3. Understand the choice of the analysis

…based on the research questions and data you have

Why the authors chose to employ this/those analysis/es?
Which questions these analyses answer to? Do they address to specific perspectives of the linguistic phenomenon, or they are more general?
Did these analyses bring an innovation in terms of methodologies employed in the field? Or they have been already used (see 1)?

The case study

Starting point: Szmrecsanyi et al. (2017)

This paper investigates two of the well-known alternations in English, the dative and the genitive alternation in four varieties of spoken English. An ensemble of statistical analyses is employed to understand the extent to which the probabilistic grammar of genitive and dative variant choice differs across varieties.

Our mini-replication study will focus only on Spoken American English, and only on the dative alternation
Goal of my analysis is slightly different from the original one: to get familiar with the so-called traditional, top-down, manually annotated predictors for the dative alternation, and how well they predict the choice between the two variants in spoken American English.

The dative alternation

(1) a. Ditransitive dative variant

[The waiter]_subject [gave]_verb [my cousin]_recipient [some pizza]_theme

b. Prepositional dative variant

[The waiter]_subject [gave]_verb [some pizza]_theme [to my cousin]_recipient

What is an alternation? See Pijpops (2020) and Gries (2017).

Predicting the DA: the dataset

A core section of every variationist research is the dataset and its annotation: Szmrecsanyi et al. (2017) presents two comprehensive and homogeneously manually annotated datasets for both alternations.

The paper offers a very, very good and detailed description of the dataset and its construction: read it to get an inspiration for your own study!

Predicting the DA: the dataset

Linguistic information

The dative tokens for American English were elicited from the Switchboard corpus of American English (Godfrey, Holliman & McDaniel 1992), as described in Bresnan et al. (2007). The Switchboard corpus covers telephone conversations collected at the beginning of the 1990s.
This dataset contains only observation with the verb give as verb of the dative construction.
The collection follows Bresnan et al. (2007) directions in defining interchangeable ditransitive and prepositional dative variants: only instances of the verb give with two argument Noun Phrases, with the exception of non-interchangeable contructions, were considered.

Predicting the DA: the dataset

Statistical information

The original dative dataset counts 4136 observations, with the American English section counting 1190 observations, and a manual annotation for 25 predictors.

Predicting the DA: the dataset

Language-external predictors

Variety: in our case, US - one level
Speaker.ID
Speaker sex (only for a subset of observations)
Speaker year of birth (only for a subset of observations)

Predicting the DA: the dataset

Language-internal predictors: the authors annotated for well-known determinants of dative variation.

Response.variable: Ditransitive dative versus prepositional dative.
Recipient/Theme.type: The annotation distinguishes between the following categories: (1) noun phrase; (2) personal pronoun; (3) demonstrative pronoun; (4) impersonal pronoun.
Recipient/Theme.definiteness: The annotation distinguishes between the following categories: (1) definite; (2) indefinite (3) definite proper noun.
Recipient/Theme.animacy: The annotation distinguishes between the following categories: (1) human and animal; (2) collective; (3) temporal; (4) locative; (5) inanimate.
Recipient/Theme.length: Length of the recipient and theme phrases in orthographically transcribed words.
Semantics (of dative verb): (1) transfer; (2) communication; (3) abstract.
Recipient/Theme.head: Head lexeme of both the theme and the recipient.

Predicting the DA: the dataset

Language-internal predictors: further manipulation

Reducing the predictors into binary contrasts: Recipient/Theme.type were reduced to pronominal ([2], [3], [4]) versus non-pronominal ([1]); Recipient/Theme.definiteness were reduced to definite ([1], [3]) versus indefinite ([2]); Recipient/Theme.animacy were reduced to animate ([1]) versus inanimate ([2], [3], [4], [5])
Creating a the new predictor Length.difference: the Recipient/Theme.length measures were combined into a relative measure of length, calculated as log(Recipient.length) - log(Theme.length).
Recipient/Theme.lemma: the annotated lemma of the heads.