University of Ottawa Ottawa, Ontario, Canada March 13-14, 1983
Sunday, March 13
Monday, March 14
The growing interest in oral proficiency during the past decade prompted Educational Testing Service to undertake the development of the Test of Spoken English (TSE), a standardized test of oral proficiency for non-native speakers of English. The test has been validated for the selection of non-native teaching assistants applying to American universities.
The objectives of this study were to
(a) validate the TSE for the selection and certification of non-native health professionals,(b) establish a set of procedures for determining standards of proficiency on the TSE, and
(c) pilot test these procedures in selected health professions in which the test could be used.
Groups of judges who were practitioners in four health professions--nursing, pharmacy, medicine, and veterinary medicine--judged the acceptability of the speaking performance of examinees who had taken TSE.Each group was asked to provide for their profession ratings for three distinct work situations, such as communicating with professional colleagues, teaching in the field, or working in a hospital. Judges were asked to indicate whether each examinee was at least minimally acceptable to function in each situation. Judges' ratings were then related to the examinees' TSE scores. A larger group of consumers of medical care gave more global ratings to the same examinees for each of the four professions.
Alternative methods were applied to determine possible ranges of cutoff scores on the TSE. One of these methods involved a consideration of the consequences of different decision outcomes, e.g., whether the certification (or licensing) of an unacceptable speaker is a more serious or less serious error than failing to certify an acceptable speaker. A range of possible cutoff scores was computed for each situation within each profession.
The results of the study support the interpretation of the TSE as a valid measure of oral language and provide some progress towards the setting of appropriate standards of oral language proficiency within the health-related professions.
Canale and Swain, in 1979, had proposed that grammatical competence and sociolinguistic competence would be two of the three areas, or "systems of knowledge", to be included in a theoretical framework of communicative competence. Two direct tests of oral skills, a simulated telephone conversation and a subsequent job interview, were designed for senior secondary EFL students in Hong Kong, in an effort to construct meaningful and valid measures of communicative ability. Various devices were used to elicit testees' performance with respect to specific grammar points, selected vocabulary items, and relevant "sociocultural rules of use", a component of Canale and Swain's sociolinguistic competence. It was felt that two types of rating scales were needed for assessment, graded scales of performance and a checklist of the sociocultural rules observed. The graded scales were used to assess testees' performance in areas such as pronunciation, fluency, grammatical structures, vocabulary, and to record the overall impression of the raters. Two markers were employed on each testing occasion, and a total of 50 students from five different schools and two grade levels were examined. To obtain sociolinguistic data on the testees' background, questionnaires were administered after the tests.
Statistical analysis of the data shows that the graded scales and the checklist recorded different dimensions of the testees' performance and suggests that in a communication test, both integrative and discrete-point marking are necessary. Furthermore, results show that the graded scales all seem to be measuring a single aspect of the testees' performance, and hence a global impressionistic judgment could replace detailed assessments in carefully defined categories of grammatical competence for ordinary classroom purposes at least. Inter-rater reliability has also been demonstrated for the tests. -
The recent arrival in the U.S. of large numbers of non-or limited-English speaking refugees from Southeast Asia and other areas has created the need for development of performance-based testing procedures to assess the refugees' ability to accomplish basic "survival" tasks in English, including, for example, giving basic autobiographical information orally, telling time, asking for and giving directions, shopping for food and reading printed caution signs and other similar "sight-word" materials, and filling out simple data forms.
This paper describes the development, trial administration, and final revision of two such test batteries, one for use in intensive pre-departure arrival training in the United states. The considerations involved in defining the content areas to be covered in the tests are described, as well as the procedures followed in developing testing formats and question types capable of assessing examinee performance in a face- and process-valid way within the stringent and less than optimum administration conditions under which the tests would need to be used. The testing approach finally adopted (booklet-mediated face-to- face interview) is described in detail, as well as the individual tasks represented in the test. Also described is the use of listening comprehension, fluency, communication, and reading/writing against which individual item performance was evaluated for final item selection. The paper concludes with a discussion of the obtained internal test reliabilities, scoring reliability, concurrent validation data based on instructor judgment of overall language proficiency, and description of plans for follow-up validation studies.
Researchers have noted that research in second language acquisition has been hampered by a lack of valid criterion measures of language proficiency (Larson-Freeman and Strom, 1977). Indeed, there is little agreement on what constitutes language proficiency or what aspects of language proficiency various types of tests measure. A case in point is the frequently made distinction between discrete point and integrative tests and models of language acquisition and performance which have been developed on the assumption that this distinction is valid.
Bachman and Palmer (1982) examined several theoretically plausible models of communicative proficiency and found that the model which best explained their data included three factors: a general factor, a linguistic-pragmatic competence factor, and a sociolinguistic competence factor. In this study, we examine the relationship of a number of background variables and learner characteristics to the language competencies previously identified and to performance on different kinds of tests (communicative and non- communicative).
Background variables include length of stay in an English speaking country, amount of exposure to an acquisition intensive environment, amount of exposure to a formal learning intensive environment, and age of first exposure to English. Learner characteristics include age, degree of integrative motivation (desire to acculturate), degree of instrumental motivation, and extent of perceived use of conscious rules.
Specific hypotheses regarding the relationship between type of motivation and different language competencies and between environment of exposure to English and performance on different types of tests are tested.
In the autumn of 1982 a new listening performance test was developed for use in standardized testing programs at the English Language Institute. The test involves the interpretation of stressed elements in English sentences. For example, subjects hear sentences such as the following:
A. I wanted sugar in my tea.
B. Does the movie start at nine A. M.?
C. No, Jane's brother's a novelist...
The underlined words are spoken with emphatic stress, and the subjects are asked to choose a response from among three alternatives. Subjects must decide, in successive sections, what the speaker probably meant (as in example A, the speaker didn't want cream), what the speaker wanted to know (as in example B, did the movie start in the morning or in the evening), and what the speaker would probably say next (as in sample C, ...not her husband).
This paper will be a discussion of the theoretical background to the test with reference to pragmatic theory, and a report on the statistical analysis of the results of some 500 test candidates. Suggestions will be made for the improvement of the instrument, especially in the area of pragmatic validity.
Looking for effective procedures for testing reading comprehension (RC) is important for teachers, administrators, test developers and researchers. Most RC tests used today use the multiple choice (MC) procedure which requires the test-takers to select an answer out of several alternatives. This provides the tester with indication of the extent to which the test-taker can comprehend a given text. Other testing procedures for RC such as open-ended questions, summaries and translations are used as well, although not as commonly as MC because the scoring is not as efficient. MC type tests have been criticized on the grounds of strong method effect over the trait of RC, on the difficulty in writing good questions, and the need to pre-test items before they can be used efficiently. The language that should be used on RC tests is also a controversial issue. In most RC tests the questions are presented in the second language. Thus a student's wrong answer may not necessarily result from not understanding the text but perhaps from not understanding the questions. Some testers suggest giving the questions in the first language of the test taker. These are only some of the questions which are used in an attempt to look for effective ways of testing RC.
The paper reports on a large scale study which examined and compared six common procedures for testing RC in order to find out to what extent the method used for testing affected the score of the students on RC. The six procedures are: 1) MC - questions in L-2 (English); 2) MC - questions in L-1 (Hebrew); 3) open-ended questions in L-2; 4) open-ended questions in L- 1; 5) summary in L-2; 6) summary in L-1. The six methods were all based on the same texts - fifteen short passages and two longer ones, and were administered to a sample of 1900 randomly distributed high school students studying EFL in Israel. The results of the analysis provide information on the effect of the methods of testing RC on the trait of reading comprehension.
Recent studies in second language testing have emphasized the unitary nature of adult ESL students' underlying linguistic competence in the second language (Bachman and Palmer, 1981; Oller and Hinofotis, 1980; Oller, 1979). However, ESL test practitioners and instructors typically observe differences in student test performance among the major skills of reading, speaking, listening, and composition. The study to be reported involves 593 international students who attend a major American university, including January, May, and August 1982 test groups subsequently divided into four ESL proficiency groups from low to high. The study investigates the following questions.
1. Does a factor analysis of six ESL subtests (oral interview, composition, listening, grammar, vocabulary and reading) result in a single underlying factor which may be inferred to reflect global second language competence?2. Across four ESL proficiency level groups, do analyses of variance reveal significant differences among ESL skills, e.g., listening and composition? Within each proficiency group, do t-tests for non- independent groups reveal significant differences among major skills? Which ESL skills are most closely associated; which diverge most from one another, according to correlation patterns and t-test results?
3. Given a spectrum of test types, from the integrative and subjectively-rated oral interview and composition to the discrete point grammar and vocabulary measures, what do linguistic error analyses reveal about the competence versus performance of six randomly-selected students (representatives of the three lowest ESL proficiency groups)? What do such analyses reveal about the relationships between type of test and individual performance?
An oblique factor analysis employing reliabilities on the diagonal and an alpha factor analysis with varimax rotation were conducted with raw scores, with January, May and August test groups treated separately. Both procedures reveal a single factor for all test groups. However, analyses of variance reveal significant differences between proficiency groups, and t-tests reveal significant differences among skills within proficiency groups, as great as 50 percentage points between subtest scores. Although the subtest types differ, speaking and listening are the most closely associated skills consistently across January, May and August groups; speaking and reading consistently diverge the most. Error analyses suggest (i) some specific improvements needed in both integrative and discrete point subtests and (ii) a narrowly-defined grammatical competence superseded by a larger communicative competence in relation to these major test types.
This paper updates the research we presented at the 4th Annual Colloquium on Language Testing Research. There we reported on two pilot studies bearing on Jim Cummins' (OISE) Language Interdependence Hypothesis. This widely influential hypothesis asserts that literacy-related aspects of language proficiency--such as involved in a formal writing task--are interdependent across an individual's first and second (or other) languages, i.e. manifestations of a common underlying proficiency. Our pilot revealed a significant but modest overall correlation (r = .33, p .01; N = 106) between holistic scores on narratives in French (L1) and English (L2) written by the same Grade 9 and 10 students in advanced language arts classes in four French-language secondary schools in Ontario.
The present paper re-examines Cummins' hypothesis in light of results of our major study of students writing in French and English. Several improvements have been incorporated into this study. For example, the overall sample includes 1722 students enrolled in general and advanced Grade 9 and 10 classes in nine French-language secondary schools throughout Ontario; in both French and English, students wrote in two different genres (first-person narrative and exposition of a procedure); finally, the same four trained scorers have independently evaluated each of the four texts (two genres X two languages) for d randomly selected subsample of 50 students. As in the pilot study, each test has been scored in two different ways on two different occasions: first, holistically in terms of overall quality and second, analytically with reference to a five-level scale for each of five general criteria (language usage, norms for written text, individual expression, unity of forms and ideas, and effectiveness as an act of communication).
Three types of results are to be reported. First, inter- and intrarater reliability estimates will be examined for each scoring procedure. Second, two validity analyses will be discussed: one concerned with the distinctions drawn among the five general analytic criteria, the other with the relationship between the five analytic scores and the holistic score. Finally, we will examine the relationship between students' L1 and L2 writing with respect to genre, holistic scores and analytic scores.
Slobin and Welsh (1973) have demonstrated the usefulness of the task of elicited imitation (EI) for assessing children's linguistic competence. Use of this procedure was based on the premise that sentence imitation is filtered through the child's linguistic system. They argued that since EI is performed without communicative intent or contextual reference, the task strains a learner's abilities and provides a conservative estimate of linguistic competence.
Slobin and Welsh's study dealt with very young children. As children become older, their short term memory span may enable them to repeat a sentence by rote, without full comprehension. When a delay is introduced, however, children apparently are less able to repeat sentences they cannot comprehend (McDade, Simpson, & Lamb, 1982). Spitze and Fischer (1981) demonstrated the usefulness of EI in assessing global linguistic skills of adults learning English as a second language. Since adults appear to have a greater short-term memory capacity than children, a delay should probably be introduced in the EI task. Further, if the goal is to create a diagnostic test of specific syntactic structures, the stimuli need to be refined.
The syntax Specific Test, which is designed to test specific syntactic structures of English, is a visually presented EI task which incorporates a delay in its administration since the response is in written form. The test was administered to 53 non-native speakers (enrolled in three different college-level intensive ESL programs), 47 American college students and 116 hearing-impaired American college students. Results of this test administration indicate that non-native and hearing-impaired subjects reconstructed the stimuli during the EI task. This paper will argue 1) that having subjects write their responses provides delay and thus adequate control for rote memorization and 2) that the errors produced are a result of grammatical processing and thus reflect grammatical competence.
References
McDade, H., Simpson, M., and Lamb, D. The use of elicited imitation as a measure of expressive grammar: A question of validity. Journal of Speech and Hearing Disorders. 1982, 47, 19-24.
Slobin, D. and Welsh, C. Elicited imitation as a research tool in developmental psycholinguistics. In C. Ferguson and D. Slobin (Eds.) Studies of child language development. New York: Holt, Rinehart and Winston, l973.
Spitze, K. and Fischer, S. Short term memory as a test of language proficiency. TESL talk, 1981, 12, 32- 41.
A new kind of exercise requiring students to read an entire text in which sentences became progressively blurred was designed to measure their overall English proficiency as well as to help students develop good reading skills. But is such an exercise a valid and reliable measure of a student's English proficiency? .
Over two hundred students in a university intensive English program took a BLUR exercise, a composition test, and several standardized tests measuring listening, structure, vocabulary, reading and writing skills. The students also answered a questionnaire on two pragmatic tests and the BLUR exercise to determine whether they 1) found each measure interesting, 2) tried to do their best, and 3) considered the measure to be a valid measure of their English proficiency.
Pearson product-moment correlations and factor analyses using principal factor with iterations provided the following information:
1) Correlations of the BLUR exercise with specific skills tests were positive and moderately high, ranging from .50 to .70 (p .01). The BLUR also correlated positively and highly .70 (p .01) with the exit test battery average, indicating that the BLUR was a good measure of overall proficiency.2) The BLUR loaded positively on the principal factor, loading more highly than the vocabulary and reading subscores of one standardized test and the listening subscore of another.
Cross-tabulations of questionnaire results with class and with native language revealed that the students found the BLUR exercise and the pragmatic tests interesting, tried to do their best on them, and felt that each measured their proficiency in English.
Error analyses showed that the students make many of the same errors when doing the BLUR exercise as when completing other tasks which require writing and speaking in English.
Rasch Model latent trait analysis has recently been applied successfully to foreign language test development. Among the advantages of this procedure have been (1) elimination of boundary effects, (2) identification of invalid respondents, (3) sample-free item measures, and (4) test-free person measures. Cloze testing has long been useful as a measure of global language proficiency and as a means of observing developmental error trends.
The present study purports to Rasch Calibrate cloze item difficulty for native speakers of English and for Egyptian EFL learners at two proficiency levels -- about 100 university students in the total sample. 100 Random Cloze deletions will enable comparative difficulty analysis by language background and proficiency level. It is hoped that this procedure will introduce a response-probabilistic paradigm for the investigation of developmental trends in language acquisition.
This paper reviews twenty-five years of investigation in first and second language acquisition and addresses the implications of this research for the construction of second language proficiency measures. The main issues of language acquisition center on three topics: .(1) the psychological processes involved in the development of both competence and performance, (2) the description of developmental stages depending on discrete syntactic/morphological features, and (3) the ways in which psychological processes interact with language input in theacquisition of the target language.
An exploration of these issues leads to an integrative theory of second language acquisition which provides the following implications for the construction of language proficiency measures suitable for both research and adequate placement of students in instructional programs:
(1) Comprehension is probably a more valid measure of competence than either imitation or elicited production.
(2) The sampling of syntactic/morphological features must be large enough to account for the integrative nature of the acquisition process.
(3) That multiple choice items for secondary school students and adults can be written to reflect documented stages of target language development through a careful selection of appropriate distractors.
In trying to identify relevant dimensions of oral proficiency (in English L2 as much as in any other foreign/second language) it is important to go beyond the speech act level and also investigate into the interactional structure of the learner/native speaker discourse. In my paper I will suggest that the ability to give and elicit feedback in its broadest possible sense is one such dimension which contributes to the development of individual differences among learners. This specific ability has hardly been studied in any detail so far (see, for example, Caies 1981, or programmatically, Perdue 1982). Theoretically it is considered to be a valid concept within the broader framework of social interaction as well as within a modified input hypothesis. Empirically this construct seems to be rather easily accessible insofar as feedback processes (being the verbal or non-verbal indicators of the underlying feedback ability) are to be found in any natural or elicited interactional language data like conversations, interviews, role-plays etc. It is one of our basic assumptions that these feedback processes are changing in number, type, and quality as the learner makes progress towards the L2 (linguistic and social) norms and becomes more proficient interactionally. (The influence of the L1-based prior knowledge of a learner, of his/her linguistic, cognitive, and social background experiences on mastering communicative tasks in L2 cannot be dealt with in this paper; cf., however, Vollmer 1982.)
Having given a brief outline of the study and its theoretical perspective I will then focus on the description of different types of feedback processes and their linguistic means of realization by advanced German learners of English as L2. It will be shown that both self-initiated and other-initiated feedback processes can be ordered in rather systematic ways along criteria like implicit/explicit, perception of role relationship and inter-actional setting, degree of certainty about one's own knowledge (linguistic, factual, psychological), level of aspiration of comprehension/understanding, degree of awareness/consciousness and evaluation of communicative flow etc. Thus feedback processes will be defined as certain moves in oral interaction which are discoursal and structural in nature, yet can be either metalinguistic or metacommunicative or just as well illocutionary and/or propositional in form. (Repair and face-work are only part of the notion of feedback as introduced here.) Each of the types mentioned will be illustrated by at least one example taken from Osnabruck corpus of elicited interview/role-playing data.
The last section of my paper will be devoted to discussing questions of how feedback processes could be used in evaluating pragmatic competence in L2, of how progress in this area could be defined in operational terms and "measured" quantitatively as well as qualitatively, of how feedback ability in second language learners is acquired and could be further developed (e.g. by instruction).
This paper will present two research proposals for conducting future L2 reading comprehension research. The first proposal involves eye-voice span (EVS); the second, text processing.
This paper will suggest the use of eye-marker and line of sight approaches for data collection. Some interesting research questions for L2 reading comprehension include: does EVS increase with more highly/less highly constrained syntactic structures? across different language groups? across proficiency strata? does EVS extend to natural intrasentential boundary points? If so, to what extent is EVS related to language proficiency? to short-term memory? what language groups exhibit significantly different fixation durations? on what kinds of structures? at what proficiency levels?
Studying the processes and products of text structure generation by patients who have impaired language abilities due to damage sustained to the central nervous system could give some insight on the neurological bases/factors that determine how text structure is created. Patients having transcortical motor aphasia or Broca's aphasia would be needed for such research. Patients with transcortical motor aphasia (also called frontal dynamic aphasia) exhibit no phonological impairments and no salient linguistic deficits with respect to lexical or grammatical formatives; their auditory and visual comprehension are relatively intact. Ideally, one should be able to obtain data bases from L1 and L2 readers who have no language impairments for comparison with the aphasia patients' data base.
* (Paper to be presented by Grant Henning)