Monday, March 5, and Tuesday March 6, 1984
Sheraton Houston, Abilene Room
8:30 a.m. - 4:45 p.m.
How good are our present tests of language proficiency? How can they be improved? These are the abiding questions which underly the work of researchers in language testing. And what of the revolution at hand with computerized adaptive language testing now possible owing to modern statistical theory and the increasing abundance of computers and computer knowledge in language-learning settings? These are the major interests as researchers meet for the sixth annual session.
A major emphasis is on oral testing, theory and procedures, in this Symposium. Concern with validity, scoring, and acceptable performance standards characterizes studies to be presented. Considerable interest in cloze procedure continues. Researchers work on the frontiers of language testing as they forge new questions to study about several varieties, including oral and written, of cloze. Topics also include evaluation of compositions and of reading comprehension, with measures of these skills subjected to up-to-date statistical techniques. Attitudes of test-takers are also considered in studies to be presented.
Perhaps most compelling is the researchers' concern with the future which is so clearly upon us: the pairing of innovative testing procedures with computers. The possibilities are exciting. This new work also merits the best foresight that can be brought to bear. A panel discussion from the vantage points of modern statistical theory, modern language testing theory, practical considerations, current experience, and research questions will consider computerized adaptive language testing--CALT--, not a moment too soon.
See you in Houston'
Chair: Virginia Streiff, Measurement and Research Services, Texas A&M University
Presentations: Monday, March 5
Presentations: Tuesday, March 6
ABSTRACTS
Monday
This paper reports the findings of a study designed to investigate the performance validity of oral proficiency tests which employ interview procedures. Building on previous studies which have supported the content and construct validity of oral interview tests, this study explores the link between ratings of oral interview tests as judged by ESL specialists and estimates of communication ability as judged by non-ESL-specialists. In order to explore further the need to differentiate among the non-specialists, three groups of reactors were chosen. In an academic setting a wise range of non-native speakers of English were administered the oral interview tests. A sample of interviews representing the entire range of those evaluated by the ESL specialists was presented to faculty and student peers, university students at random, and non-academic personnel for their evaluation. The results of the study help to support the need to distinguish among audiences and settings in making estimates of oral communication ability. Moreover, evidence is provided to encourage expansion of verbal descriptions usually accompanying reports and ratings of oral proficiency to account for such differences.
The Test of Spoken English ITSE) was recently developed by the Educational Testing Service to provide a measure of oral English proficiency. The criterion validity of the TSE has been established (Clark and Swinton, 1980) using the FSI Oral Interview as the criterion measure and TOEFL scores as general proficiency ratings. However, actual use of TSE scores to set cut-off scores is largely a matter of determining local standards of oral proficiency. This paper reports on research conducted at a language-oriented graduate school that now requires TSE scores for all non-native English speaking applicants for admission. Following procedures similar to those used by Livingston (1978) in setting standards for bilingual teachers, students and faculty members rated non-native speech samples at various TSE levels as acceptable or unacceptable for graduate students enrolled in the school. Research questions addressed in the study include the following: 1. What is an acceptable TSE score for non-native English speaking students at this school? 2. Is there a statistically significant difference between the level of acceptability set by faculty versus student raters? 3. Do the levels of acceptability vary significantly across the five graduate divisions at the school? 4. How do levels of acceptability preferred by faculty and students at this language-oriented school compare to those set at other institutions using the TSE? The results and the process of standard-setting will be discussed in relation to the published literature on determining cut-off scores as well as literature on native speakers' tolerance of second language learners' errors.
The purpose of this study was to investigate the efficacy of administering a "spoken" cloze test. Two versions of a 50-word, every-7th deletion cloze passage were developed: one open-ended and the other multiple-choice. Forty-five ESL students were asked to successively read these two versions aloud on tape. They were required to supply or choose the answers for each blank as they proceeded. The results were scored in a variety of ways including multiple-choice, exact- answer and acceptable-answer scoring methods. The results were analyzed for relative reliability across scoring methods. Then they were compared to scores on the three subtests of TOEFL, a writing sample and a written cloze passage in order to explore the relationship of spoken cloze to other types of language tests. The results indicate that spoken cloze is a reasonably reliable alternative test format. The possibility of using this format for exploring discourse strategies is also discussed.
This paper examines gestalt theory's possible contributions to the ILR threshold rating and to the writing of the ILR foreign language proficiency definitions and their academic counterpart, the ACTFL/ETS Guidelines. The paper presumes a thorough knowledge of ILR rating procedures and of the general theory underlying criterion-referenced foreign language testing. Following gestalt theory, the author presents a rationale for definitions describing 'wholes' (definitional gestalts) instead of 'parts' (perfornamce gestalts). As a consequence, definitions must be written so as to clearly and concisely transmit each level's definitional gestalt to maintain the gestalt's integrity, while simultaneously indicating the permissible extent of variability in performance gestalts at ILR base levels, at ILR plus levels and in ACTFL/ETS subranges. Moreover, the author suggests that in applying these definitions and guidelines a performance should be matched first to a base level definitional gestalt and later, to a suitable plus level or subrange definition. Finally, he stresses that no matter how well written, verbal descriptions will ultimately not suffice for some rating applications and must be supplemented by the 'living tradition' that has resulted from 30 years of extensive government testing experience.
The new placement test TC-CS developed co-operatively between test specialists and teachers for the evening course program of the Language Training Program Branch takes account of candidates' general language competence, their knowledge of the specific language content basic to the evening course and, most importantly, their ability to use this knowledge in performance. The distinction between knowledge and performance was emphasized by teachers who noted that students differed markedly in their ability to use the language knowledge they had acquired. Furthermore, the training program has been revised and now aims at a more communicative approach focusing on language performance. The cloze test designed to measure language competence (along with a self-evaluation measure to identify true beginners) was the preliminary screening device placing students into three categories: beginner, intermediate, and advanced. The students were then administered a multiple choice test corresponding to these three categories and covering the respective part of the course content. This part of the test assigned a specific lesson to the student. The short performance interview classified students into low, medium or high performers. Thus it is possible to have three classes of students at any one lesson but differing in their ability to use language. This paper describes the tests' development and administration to over 600 anglophone public servants learning French. Test analysis are presented for discussion. Of special interest is the validation study after placement, particularly with respect to performance groupings.
In a previous paper presented at the Fifth Annual Colloquium on Research in Language Testing, I started out to describe the theoretical construct "ability to give and elicit feedback" within native speaker/learner oral interactions. This assumed ability was considered to be an integral part of a learner's conversational competence, yet of central importance for successful communication and - from a research point of view -relatively easily assessible on the performance level. Thus, I have tried to identify a number of different types of feedback precesses and illustrate their linguistic means of realization. However, problems of how to evaluate feedback ability could not be dealt with adequately. In this paper I will focus on problems of measuring communicative performance, namely feedback ability, in our project at the University of Osnabruck. I shall discuss how to define variables, how to identify 'items" out of a large corpus of spontaneous speech, how to construct scales and give weights, how to make frequency counts, build ratios or form combined measures. Above all, the relationship between qualitative measures of feedback ability (yielding on interval scale) and quantitative ones (yielding a ratio scale) are being investigated, how to make comparisons between these two and what they mean. The second portion of my paper will reconsider the notion of conversational competence and the role of giving and eliciting feedback as an indicator of cooperativeness and metadiscoursal ability. In the third and last part the attention is shifted away from the learner to the native speaker as interviewer or interactant in general.
This research project, conducted in The Netherlands during Fall 1983, centers on the premise that native language influences in the area of grammatical acceptability of target language sentences. To access this influence, two grammatical acceptability measures were developed on the basis of an error corpus containing sentences with errors that were produced by university students of English as a Foreign Language in The Netherlands. These measures were administered to approximately 125 first-year students and approximately 125 third-year students from the Department of English at the University of Utrecht. On one test the subjects were to locate and correct the error while the second test was speeded and the subjects needed only to react to the acceptability of the sentence. In both measures there were sentences with no errors. Both measures were also administered to native speakers of British English, the target language for the Department of English. Expectations based on the Acquired versus Learned Knowledge Theory or the Competence Knowledge and the Performance Knowledge Theory would predict that performance on the speeded version would be lower than performance on the slowed version. Results indicate that this is true for both groups of subjects, but that the distribution of difference scores is bimodal; approximately one-third of the subjects have different scores of almost zero, while two-thirds of the subjects have difference scores that are significantly larger. The distinction between these two groups of subjects is not related to any language proficiency measure that was concurrently administered to the subjects. The paper discusses the possible effects of cognitive styles, language ego permeability, and permeability of competences on the distinction between those subjects who do demonstrate a difference between speeded and slowed performance and those who do not demonstrate this difference.
Cloze procedure holds promise as a valuable tool in computerized adaptive language testing (CALT), with multiple choice (MC) cloze suggested as an alternative to open-ended cloze. MC cloze is potentially useful in circumstances where machine-scoring is needed or preferred owing to speed, objectivity, and, in the case of CALT, increased accuracy of skill estimation. One of a number of questions surrounding MC cloze is that of how best to develop distractors, that is, response choices which meet traditional standards of plausibility, discrimination, and difficulty. A further, perhaps more pertinent, question is that of how distractors developed from responses function in terms of Rasch-calibrated difficulty across alternate forms of the measure. This presentation will report on traditional item discriminations and difficulties as well as on Rasch- calibrated difficulties. The results may be of use in further development of MC cloze for large- scale language testing, computerized adaptive and otherwise.
This paper will present comparative analysis of ESL reading comprehension test data which will be submitted to statistical procedures from classical test theory and latent trait measurement. The purpose of the study is to determine which approach will yield more usable information for an ESL testing situation. The principal testing instrument will be a series of reading comprehension questions based on short texts. For one aspect of the study a multiple-choice, sentence-completion grammar test will be used as a criterion measure. For each test item the following classical test theory indices will be computed: item difficulty, item discriminability with sample separation, point-biserial correlation coefficients, and item variance. Following Henning (in press) an internal construct validation procedure will be conducted on the reading test in the following manner. If the reading comprehension items have construct validity, the point-biserial correlation between each reading item and the total scores for reading should be higher than the point-biserial correlations of the same items with the total scores of the sentence-completion, grammar test. Mathematically this relationship would be expressed as follows:
r r
RC1RCT > RC1GT.
The reading comprehension data will also be analyzed using the Rasch One-Parameter Model which is probabilistic in nature because persons and items are graded for ability and difficulty and are judged according to the probability of their response patterns given the observed person ability and item difficulty. The information derived from the person-separability index will be compared with the classical test theory indices cited above.
Written samples of English as a foreign language have been evaluated according to a host of methods ranging from holistic impressions to tallies of grammatical and mechaincal errors. The present research posits a five-point nativeness continum for usage of selected syntactic class items. 38 Compositions of controlled topic and 124-word man length written by graduate level Egyptian EFL learners are then located on the nativeness continum with respect to usage of articles and prepositions. Results are then compared with more conventional evaluation criteria, including holistic ratings and mechanical error frequencies. Rasch Model latent-trait analysis is employed to position function words on a difficulty/nativeness continum and to test the goodness of fit of persons and items to the predictions of the model.
The purpose of this research was to complement contemporary approaches to test evaluation in ESL. In addition to classical uses of concurrent validity and reliability, recent quantitative experimental research has ranged from construct validation (Bachman and Palmer 1983) to test affect (Shohamy 1980, Madsen 1982). As valuable as these findings are, alternate methodologies in second-language research (Schumann and Schumann 1977, Ochsner 1979, Long 1980, Cohen and Hosenfeld 1981) hold the promise not only of corroborating earlier results but also of providing additional insights. The present study sought to evaluate the midterm tests in two college ESL classes by utilizing personal interviews to generate retrospective data. The pilot phase involved 9 students enrolled concurrently in 2 ESL courses taught by the same instructor. The main study involved 13 students enrolled in the same 2 ESL courses. In the pilot, students were interviewed by their instructor, who took notes on their responses. In the main study, an outside interviewer was used, and responses were taped. Both groups were administered the Alpert-Haber Achievement Anxiety Test. Following an informal conversational interview, a "general interview guide approach" (Patton 1980) was followed. Responses included emotive reactions to test types, evaluations of procedures, and discussion of strategies employed in preparing for and taking the tests. Findings revealed that students were affected by differing test formats, differing anxiety ratings and test settings, practice, course subjects, and cultural background.
In contrast to a traditional test in which each testee is essentially required to respond to the same set of questions regardless of his or her individual ability to do so, an adaptive test is one that can be tailored during its administration to the level of performance of each testee. The purpose of this paper is to provide an overview of research related to this very promising trend as it has emerged in the area of language testing. This overview will be presented in three steps. (1) The main characteristics of the development, administration and interpretation of an adaptive test will be briefly contrasted with those of a traditional one. (2) Exemplary work on adaptive language tests will be reviewed. Two categories of adaptive language tests will be examined: oral interaction tests, perhaps best exemplified by the Oral Interview of the U.S. Government; and computerized adaptive tests as being researched at Educational Testing Service, the U.S. Department of Defense and elsewhere. As well, we will outline our own research activities in the areas of microcomputer-assisted teacher training in adaptive oral interviewing and of development of authoring systems to permit construction of adaptive tests of reading and listening comprehension. (3) Finally, some of the major advantages and disadvantages of adaptive language tests will be examined. Particular attention will be devoted to three considerations: the effects on the motivation and test-taking experience of the testee; the accuracy of measurement and validity of interpretation; and the practicality of test development and administration.