LSA 125: Psycholinguistics and syntactic corpora

T. Florian Jaeger | session 1 | TuTh 3:30 – 5:15, 370 Dwinelle Hall

This class focuses on the recently emerging use of syntactic corpora for research on language production (the methods taught should be relevant to other research questions, too, e.g. in sociolinguistics, language comprehension). To communicate, speakers need to encode an intended message into linguistic structure and, ultimately, into a stream of motor commands to articulation. During this incremental process, there often are several ways to continue an utterance that are compatible with the intended message. By studying the variables that affect these choice points, we can gain insights into the representations and mechanism underlying language production.

Syntactic corpora provide a great way to investigate structural choices, such as, for example, morphosyntactic reduction (e.g. is not vs. isn't; want to vs. wanna; etc.), word order alternations, and syntactic reduction (to-deletion: Can anyone help me (to) pack my suitcase?; object drop, etc.). The goal of this class is to provide students with the necessary tools to extract and analyze syntactic data from corpora (parsed and unparsed). The class begins with an overview of available syntactic corpora and search tools, and introduces students to one of these search tools (TGrep2). While corpora, particularly spoken corpora, can be a great source for naturally distributed data, the gain in ecological validity comes with challenges for the statistical analysis. These will be addressed in class (multiple regression; multilevel models). The class will walk students step-by-step through examples of corpus-based research on language production. Students will be required to conduct their own small studies. Where possible we will also use data from language others than English (e.g. optional case marking in Japanese, or classifier substitution in Mandarin Chinese).

Note: Two extra evening sessions will provide students with the required background on regression modeling in the software package R (used in class).

Reading: Selected materials available online.

Prerequisites: Familiarity with basic statistic concepts (statistical inference, probability distributions) will be assumed. Scripting experience will be advantageous, especially for students' projects, but is not required. A short crash course in UNIX/LINUX shell syntax, necessary to conduct TGrep2 syntactic searches, will be given in class, but students need to be aware that efficient corpus work involves scripting and shells. The class will assume familiarity with basic concepts of syntax (what is a VP, what is an NP, etc.).

Areas of linguistics: Language development and psycholinguistics; Syntax, semantics, and morphology

