February 27 to March 1, 1992
Asian Studies Centre
University of British Columbia
Vancouver, British
Columbia
Special Acknowledgements
1992 Colloquium Organizing Committee
John de Jong, Grant Henning and Marjorie Wesche (Chair)
Local Arrangements
Rick Berwick, Alister Cumming and Helen Tegenfeldt
LTRC Archives
Fred Davidson
Abstract Readers
Charles Alderson
Lyle Bachman
Caroline Clapham
Fred Davidson
John de Jong
Dan Douglas
Liz Hamp-Lyons
Grant Henning
Thom Hudson
Michel Laurier
Bonnie Norton-Pierce
Doreen Ready
Elana Shohamy
Charles Stansfield
John Upshur
Marjorie Wesche
Program Booklet
Doreen Ready and Marjorie Wesche
Preparation of Documents
and Mailing List Maintenance
Beatrice Magyar
Registration
Helen Tegenfeldt, Beryl Tonkin and Marjorie Wesche
Wine and Cheese Reception
and Colloquium Refreshments
University of British Columbia,
Language Education Department
Buses to Vancouver Harbour
Vancouver Community College
Administrative Support
Second Language Institute,
University of Ottawa
Language Education Department,
University of British Columbia
Vancouver Community College
Publication of Selected Papers
Rick Berwick and Alister Cumming
And a special thanks to all presenters, chairpersons and all volunteers
Contents
Conference Schedule .................................................................................. 1
Abstracts for the Paper Sessions ................................................................... 7
arranged alphabetically
Abstracts for the Poster Session .................................................................... 19
arranged alphabetically
Abstracts for Software Applications/Demonstrations ............................................ 25
arranged alphabetically
Symposia: Overview and Abstracts ............................................................... 27
in the order of presentation
Addresses of Program Participants ................................................................. 31
LANGUAGE TESTING RESEARCH COLLOQUIUM 1992
THURSDAY, FEBRUARY 27
4:00 - 6:00 p.m. Registration
7:30 - 9:30 p.m. Wine & Cheese Reception
(hosted by UBC Language Education Department)
FRIDAY, FEBRUARY 28
7:30 - 8:30 a.m. Registration coffee, juice and muffins
8:30 - 8:45 a.m. Opening and Introduction (M. Wesche)
8:45 - 9:30 a.m. Plenary: "Task-Based Testing" P. Skehan
PAUSE
9:45 - 12:00 noon Paper Session I. Chair: J. de Jong
C. Clapham: "What Makes an ESP Reading Test Appropriate for its Candidates?"
C. Elder: "How do Subject-Specialists Construe Second Language Proficiency?"
J. Hulstijn: "Individual Differences in L2 Proficiency as a Function of Individual Differences in L1 Proficiency"
K. Perkins, G. Henning: "Differential Person Functioning on Language Proficiency Tests"
12:00 - 1:00 p.m. LUNCH (buffet tickets available)*
1:00 - 2:15 p.m. Paper Session II. Chair: B. Spolsky
C. Stansfield, D. Kenyon: "Comparing the Scaling of Speaking Tasks by Language Teachers and by the ACTFL Guidelines"
C. Taylor, G. DeMauro, B. Voltmer, B. Henderson: "The Impact of TWE Reader Qualifying Criteria on Reader Performance"
PAUSE
2:30 - 4:15 p.m. Symposium: "Educational and Social Impact of Language Tests".
Chair: B. Spolsky
E. Shohamy**, E. Cascallar**, C. Alderson, D. Wall. Discussants G. Henning, B. Spolsky.
PAUSE
4:30 - 6:00 p.m. Test Software Applications/Demonstrations
Chair: G. Henning
M. Pienemann, I. Thornton, A. Mackey: "On-line Diagnostic Screening of Interlanguage Acquisition by Expert System"
M. Des Brisay: "Applications of TESTGRAF in Setting Cut-off Points on ESL Tests"
M. Laurier: "Using the Information Curve to Assess Language CAT Efficiency"
* lunch tickets must be purchased at registration
** symposium organizersLANGUAGE TESTING RESEARCH COLLOQUIUM 1992
SATURDAY, FEBRUARY 29
8:00 a.m. Coffee, juice, muffins
8:30 - 10:15 a.m. Paper Session III. Chair: D. Porter
A. Brown: "The Role of Test-Taker Feedback in the Test Development Process"
C. Gordon: "The Effects of Affective Variables and Testing Methods on Test Performance in EFL"
I. Wijgh: "A Communicative Test in Analysis"
PAUSE
10:30 - 12:15 a.m. Paper Session IV. Chair: L. Hamp-Lyons
S. Ross: "Accommodative Questions in Oral Proficiency Interviews"
E. Shohamy: "Discourse Validation of a Direct Versus a Semi-Direct Oral Test"
R. Young, M. Milanovic: "Discourse Variation in Oral Proficiency Interviews"
LUNCH (box lunches available)*
12:30 - 1:45 p.m. LTRC Business Meeting "Brown Bag"
1:45 - 2:00 p.m. Report on International Language Testing Association (C. Stansfield)
2:00 - 2:15 p.m. Presentation on Language Testing (A. Davies, J. Upshur, eds.)
2:30 - 4:45 p.m. Paper Session V. Chair: J. Upshur
G. Buck: "The Analysis of Multidimensional Data Sets"
J. de Jong: "Methodology for Formally Combining Dichotomously and Polychotomously Scored Items"
G. Henning: "Dimensionality and Construct Validity of Language Tests"
T. McNamara, A. Brown, C. Elder, T. Lumley: "Mapping Abilities and Skill Levels Using Rasch Techniques"
5:30 p.m. Buses to Vancouver Harbour from Gage Court and Asian Studies Centre
3 - 4 hours Banquet and Harbour Cruise
* lunch tickets must be purchased at registration
LANGUAGE TESTING RESEARCH COLLOQUIUM 1992
SUNDAY, MARCH 1
8:00 a.m. Coffee, juice and muffins
8:30 - 10:45 a.m. Paper Session VI. Chair: A. Davies
J.D. Brown, E. Detmer, T. Hudson: "Developing and Validating Tests of Cross-Cultural Pragmatics"
J. Fox, B. Zumbo, T. Pychyl: "Psychometric Properties of the CAEL Assessment: An Examination of the Dependability/Reliability of Placement Decisions"
D. Lussier: "A Systemic Approach to Evaluation of Second Language Programs in Quebec"
C. Turner, J. Upshur: "Validation of Measures of Grammatical Knowledge and Communicative Ability"
PAUSE
11:00 - 12:15 p.m. Poster Presentations. Chair: G. Buck
H. Tegenfeldt, V. Monk: "The A-LINC Assessment"
M. Wesche, D. Ready: "The Ontario Test of ESL: OTESL"
M. Des Brisay, J. St.John: "Canadian Test of English for Scholars and Trainees: CanTEST"
J. Ross, S. Chyn: "TOEFL 2000: Future Directions"
E. Cascallar, M. Walker: "TSE Validation in Federal Government Language Programs"
S. Cushing, B. Lynch: "Hypothesis Testing in Construct Validation"
A. Cumming, D. Mellow: "Validity of Written Indicators of Second Language Proficiency"
D. Douglas, L. Selinker: "CHEMSPEAK: An Update"
A. Huhta, E. Randell: "Multiple-Choice Summary: A Measure of Test Comprehension"
J. Hulstijn: "Implementation Issues: National Dutch Language Certification Examinations"
12:15 - 2:00 p.m. LUNCH and Poster displays (box lunches available)*
2:00 - 3:15 Paper Session VII. Chair: E. Shohamy
K. Bailey, J.D. Brown: "Language Testing Courses: What are They?"
A. Davies: "The Role of the Segmental Dictionary in Professional Validation"
3:30 - 5:15 p.m. Symposium: "Development and Use of Rating Scales in Language Testing". Chair: G. Henning
D. Kenyon**, M. Milanovic, N. Saville, A. Cook, T. McNamara, J. Hamilton, E. Sheridan, F. Davidson, L. Hamp-Lyons, B. Tyndall
5:15 - 5:30 p.m. ANNOUNCEMENTS
5:30 p.m. CLOSURE
* lunch tickets must be purchased at registration
** symposium organizer
Papers
Language Testing Courses: What are They?
Kathleen Bailey, James Brown
There is an increasing trend for language testing courses to be offered as part of the curriculum in the ESL teacher training programs around the world. The purpose of this project was to investigate the structure, content and student attitudes toward introductory language testing courses. To those ends, a questionnaire was designed to cover a variety of topics including the instructor's background, the topics covered in the course, the types of students in the classes, as well as the students' attitudes toward language testing before and after the course. While the questions were predominantly Likert-scale in format, a number of open-ended questions were also included.
The questionnaire was sent to all the "active members" on the mailing list for the Language Testing Research Colloquium during Fall semester 1990. Two months later, a second mailing was sent out to those active members who had not responded to the first request for information.
There were a total of 150 respondents. Of these, more than half (n = 76) indicated that they had never taught a testing course (or had not taught one for a number of years). The remaining 74 respondents indicated that they had taught such a course; these 74 all completed the questionnaire. The results indicate considerable variation in the training of the instructors as well as in the shape that such testing courses take. The data are described in terms of what the typical testing course looks like, and are then cross-tabulated to show differences and similarities in teacher background, content, and student attitudes among the various types of institutions.
The Role of Test-Taker Feedback in the Test Development Process
Anne Brown
Recent research into performance test development (Kenyon and Stansfield 1991; Alderson 1988) raises the issue of the role of test-taker feedback in the test development process.
The value of such feedback is investigated in the context of a project to develop an LSP oral/aural proficiency test in Japanese for Tourism and Hospitality. Fifty three subjects drawn from a range of Japanese language courses undertook the trial version of the test and completed a post-test questionnaire in which they provided reactions to the test as a whole as well as to task types and individual test items. Reactions consisted of a mixture of open comments and ratings on a 5-point scale.
Analysis of the data delineated the relationships between the characteristics of test-takers (such as relevant occupational experience, amount of language study completed, familiarity with the test format, etc.) and their reactions to the test.
Examination of the comments provided by the test-takers on individual items together with the statistical analysis of the test proved to be of value in the item revision process. Other comments enabled substantial improvements to be made to the overall design of the test and the test handbook.
The paper concludes with a discussion of the implications of the research findings for test validity.
Developing and Validating Tests of Cross-Cultural Pragmatics
J.D. Brown, Emily Detmer, Thom Hudson
There are numerous existing methods for measuring grammatical and textual competence, but no generally accepted valid measures of communicative ability components such as pragmatic competence. This paper presents results of a project to develop and validate tests of cross-cultural pragmatic ability.
The design is a multitrait-multimethod study using the traits of power, social distance and imposition in the speech acts of requests, refusals and apologies. Three different classifications of test methods are used: indirect measures of pragmatic knowledge (open ended and multiple-choice), direct measures of pragmatic ability (role play and structured interview), and self-assessment of pragmatic skill (self-rating and evaluation of performance on video).
Subjects of this study are native speakers of Japanese in the U.S. and native speakers of English in Japan. In Spring 1991, pilot versions were administered in the U.S. and are being analyzed. In Fall 1991, students in Japan and the U.S. will be administered the indirect paper and pencil tests. From those results a smaller sample will be selected for further testing using the simulation and oral interview. All students who take the paper and pencil tests will be administered the cued response self-assessment instruments. The results of the tests will be analyzed: 1) through a content analysis comparing native with non-native speaker responses; 2) statistically with a focus on the effectiveness of individual items; 3) for trait validity and method effect.
Implications for validity testing cross-cultural pragmatic ability will be discussed.
The Analysis of Multidimensional Data Sets
Gary Buck
Most of the statistical procedures used in language testing require the assumption that the data is unidimensional. This assumption underlies the use of both IRT models, and classical procedures such as internal consistency estimates and item/total correlations. Indeed, adding together item scores is only meaningful if like is being added to like. However, theoretical and empirical work in cognition, linguistics and language testing suggests overwhelmingly that language performance depends on a complex interaction of a multitude of variables, indeed anything within human experience is a potential influence, and hence each individual item response in language test data is a unique, multidimensional composite of numerous variables.
Clearly statistics used in language testing are robust to some violation of their assumptions, but the question of importance is whether the multidimensional nature of language test data violates these assumptions sufficiently to invalidate their use. This is an empirical question, which is addressed in this paper by the analysis of a number of error-free, 60-person-by-60-item data sets, created to progressively and systematically violate the unidimensional assumption. Using a model of multidimensional data from Coombs (1964), dimensionality will be operationalized within items, not between items. Bi-dimensional, tri-dimensional and multidimensional data sets were created with traits combined in conjunctive, disjunctive, compensatory and mixed composition. The 18 data sets will be analyzed using classical test analysis and procedures based on IRT models. They will also be subjected to the more common tests of dimensionality. Results will be presented, and implications for language testing will be discussed.
What Makes an ESP Reading Test Appropriate for its Candidates?
Caroline Clapham
Researchers into the effect of background knowledge on the reading test performance of university students have tended to presume that a reading text is specific for students in a given subject area if it is based on a topic within that subject area. However, this is not necessarily the case. In a study of student reading performance in three different academic fields, Clapham (1990) showed that not all the supposedly subject-specific reading passages were as specific as intended. Some were appropriate for students in more than one of the three designated subject areas, one was inappropriate for the targeted students, and one appeared to be more appropriate for students outside the relevant academic area. In an attempt to find out why this might be, a group of judges (some were language testers, some academic subject specialists) used a modification of Bachman's Test Method Characteristics and Communicative Language Ability rating instruments (Bachman 1991) to categorise the content of 10 nominally subject-specific reading tests. Texts and tasks were analyzed according to the criteria of source text, subject area, topic, genre, propositional content, degree of contextualisation, and organisational and grammatical complexity. The results were compared to measures of the tests' appropriacy which had been gained from students' test scores and questionnaire responses, and the effect of each of the criteria on test appropriacy was considered.
This paper reports on these findings and uses them to draw up proposals for the practical selection of suitable passages for ESP reading tests.
The Role of the Segmental Dictionary in Professional Validation
Alan Davies
The paper argues that a major aspect of language test validation is the deliberate construction of a profession of language testers through education and training, including the formation of an association, the establishment of journals etc., thereby fostering agreement on goals, procedures, terminology, norms and methods of evaluating innovations. An important tool of professionalising is the writing of a specialist dictionary (Crystal 1985; Richards, Platt and Weber 1985), also called a segmental dictionary (Opitz 1983), fulfilling a similar role to dictionaries in language standardisation.
A collaborative project in writing a language testing dictionary is described, the aim being to "try and introduce a measure of normalisation in the use of specialist terms and thus facilitate the exchange of information" (Moulin 1983, 146).
Distinctions between a dictionary, an encyclopedia and a glossary are considered. Feedback suggests that what is needed is an "encyclopedic dictionary" (McArthur 1986) since the project defines and avoids essays (and so is not a glossary, Abrams 1981) but has no information on pronunciation or on historical derivation.
Problems of coverage and of entry style are discussed and readers' feedback on sample entries compared. Further offers to trail and critical evaluation of entries by LTRC participants are invited.
Methodology for Formally Combining Dichotomously and Polychotomously Scored Items
John de Jong
The method is illustrated with an empirical example based on data on a semi-direct oral proficiency test. The test consisted of 5 subtests, but only two of these will be used for the illustration: the first containing 20 dichotomously scored items on pronunciation, the second consisting of 10 items on a simple dialogue task where answers are rated in three ordered categories.
While the mathematical extension of the simple logistic model (the Rasch model) to the extended logistic model is quite straightforward, there is a potential cost as well as a benefit in incorporating polychotomously scored items. The potential benefit is that for the same stimulus, a more complex response and a finer discrimination can be obtained: the potential cost, and one that can be disclosed by the methodology, is that the ordering of the categories is not working as intended.
In the analysis it is shown that for some items the intended ordering of the categories was not as expected. Post hoc explanations of this malfunctioning are offered as well as suggestions for item revision resulting from the analysis. Results of a second administration of the test after revision will permit an evaluation of the effect of the revisions and therefore also of the appropriateness of the interpretation of the analyses from the first administration.
How do Subject-Specialists Construe Second Language Proficiency?
Catherine Elder
Recent research on rater variance (e.g., Barnwell 1990; Douglas and Selinker 1990; Hadden 1991) raises the question of whether "linguistically naive" subject-specialists may be better equipped than language experts to judge the effectiveness of particular areas of non-native speaker communication.
This question is investigated in the context of a project on the development of a classroom-based observation schedule to assess the English language proficiency of non-native speaker graduates training as secondary mathematics and science teachers. The paper examines aspects of rater behaviour as evidenced in recent trials.
The schedule was applied to observations of actual performance in the maths and science classroom as well as to the viewing of a number of videoed segments of classroom interaction. Ratings were elicited from two groups of assessors: 12 ESL teachers and 8 subject specialists (maths/science teachers/teacher-trainers). This allowed estimation of the intra- and inter-group reliability of the procedure and ultimately of the validity of using non-language experts as judges of language proficiency.
While there are significant correlations between subject specialists' and language teachers' overall judgements of communicative effectiveness, the application of t-tests and factor analyses to the data reveal differences between the two groups with respect to their ratings of particular dimensions of language use and the weighting of these dimensions in relation to global proficiency assessments. The paper concludes with a discussion of the theoretical and practical implications of these findings as they relate to the larger issue of the validity and reliability of occupation-specific performance tests.
Psychometric Properties of the CAEL Assessment:
An Examination of the Dependability/Reliability of Placement Decisions
Janna Fox, Bruno Zumbo, Tim Pychyl
In this paper we will explore some of the psychometric properties of the Carleton Academic English Language (CAEL) Assessment. As this criterion-referenced assessment is used to make placement decisions (including eligibility for university admission) for non-native speakers of English at Carleton University, key issues are the reliability and validity of these decisions.
Using H. Huynh's domain referenced reliability methods, we will examine the reliability of the placements made based on CAEL Assessments between 1988-1991 (n = 1000). Data based on the qualitative assessment of student placement by EAP teachers and students' course performance will also be presented. The results will be discussed in relation to the validity of this type of criterian-referenced test for making placement decisions in EAP programs.
The Effects of Affective Variables and Testing Methods on Test Performance in EFL
Claire Gordon
The purpose of this study is to investigate the relationships between affective variables - attribution of success and failure in EFL tests, test anxiety and attitudes toward learning English as a foreign language - and performance on multiple choice and open ended tests in EFL reading comprehension tests. The literature on the attributional theory of motivation suggests that causal ascriptions of success and failure in school performance give rise to affects and future expectancies, and these in turn are related to achievement outcomes. Research on the relationship between test anxiety and performance on cognitive tasks has produced inconsistent results with reports of both beneficial and detrimental effects of anxiety on achievement. In general, studies which have investigated the relationship of these psychological factors to school performance have focused mainly on the area of mathematics. The present study aims to test the theory for performance in EFL reading comprehension tests with particular attention to possibilities of different patterns of relationships for different test formats - multiple choice (MC) and open ended (OE).
Research Questions
1) Does receiving the test form of one's preference affect test performance?
2) Do students of different ability levels prefer different test formats or perform differently on different test forms?
3) How are affective variables related to performance on the two different test forms?
Subjects
The sample consists of 150 university students studying reading comprehension in EFL courses at the intermediate level.
Instruments
1) Two forms of a reading comprehension test (one open ended and one multiple choice) were administered randomly. Each test contained the same three short reading texts with a total of 14 items.
2) A questionnaire measuring attribution and attitude variables was administered prior to the test.
3) The Hebrew version of the Sarason RTT test anxiety questionnaire was administered prior to the test.
Dimensionality and Construct Validity of Language Tests
Grant Henning
Methods for establishing test dimensionality have received considerable research study in recent years (e.g., Boldt 1989; Goldstein 1980; Hambleton and Swaminathan 1985; Hattie 1985; Henning 1988; Oltman and Stricker 1990). This attention is appropriate given that the assumption of unidimensionality underlies the application of item response theory (IRT) and classical true score theory to test construction, test equating, and validity of ability inferences drawn from test scores. However, some researchers have lamented the constraints imposed by unidimensionality assumptions on language test content variation (e.g., Bachman 1989). Still others have asserted that IRT methodology with its unidimensionality assumption may be inappropriate for the study of complex multidimensional domains such as communicative language ability.
The present study will offer simulation evidence that "psychometric" dimensionality as measured by a variety of techniques may be somewhat independent of the number of psychological domains or constructs measured. An analysis of 10 simulated data sets, each involving 30 items with 100 examinees, will be presented to demonstrate that psychometric unidimensionality assumptions may be satisfied in situations where multidimensional constructs are measured simultaneously by the same test. It will also be shown that some techniques for testing psychometric dimensionality may suggest psychometric multidimensionality when there is clearly only one construct or dimension measured by the test. The conclusion is offered that psychometric dimensionality is sample dependent and that use of IRT methodology may not be inappropriate for tests of communicative language ability even though complex underlying constructs underlie performance on the test.
Individual Differences in L2 Proficiency
as a Function of Individual Differences in L1 Proficiency
Jan Hulstijn
In this presentation I will argue that individual differences in L2 proficiency should be conceived of as a partial function of individual differences in L1 proficiency, and that the relationship between L2 and L1 proficiency has for too long been neglected by L2 learning researchers.
To illustrate this claim, I will give a brief review of the relevant empirical literature. In particular, I will describe two studies which were conducted at our university. The first study dealt with hesitation phenomena in L2 and L1 speech. Four groups of Dutch learners of English (n = 65) read an English (L2) and a Dutch (L1) text aloud. The independent variables were English proficiency (high/low), and Grade (9/11). The dependent variables were Reading Speed, Prediction of 12 words omitted from the text, Repeats, Self Corrections and Deviations from the text.
The second study dealt with reading comprehension in L2 and L1. Subjects in this study were 60 adult, educated Turkish learners of Dutch, living in the Netherlands. They performed reading comprehension tests in Dutch (L2) and Turkish (L1), as well as a L2 vocabulary test and a L2 grammar test. Regression analyses were conducted to predict L2 reading comprehension performance on the basis of L1 reading comprehension, L2 vocabulary knowledge and L2 grammar knowledge.
The results of these studies suggest that differences between L2 learners on L2 tasks should be adjusted for differences on corresponding L1 tasks, before drawing conclusions pertaining to second language proficiency as a theoretical construct. Such a perspective could be useful for language teachers and language testers alike, highlighting the role of non-L2-specific factors in L2 proficiency.
A Systemic Approach to Evaluation of Second Language Programs in Quebec
Denise Lussier
This presentation will be an overview of a systemic approach to second language program evaluation. This approach was seen as an integral part of a curriculum revision process. It was initiated by the Quebec Ministry of Education to make informed decisions with respect to retaining or revising the program in French as a Second Language. This project started in April 1989 and involved all those concerned with the teaching of FSL (administrators, curriculum advisors, parents, teachers and students).
The development and implementation of instruments for the project will be discussed. The focus is on the second language criterion-referenced test administered to measure the learning outcomes based on program objectives. The subjects were 600 junior high school students.
Reliability and validation measures will be discussed taking into consideration the parameters identified in the Definition of Domain of the instrument. Discussion will also include procedures for data collection concerning language skill mastery levels, identification of performance standards in a communicative-based program, and the establishment of a threshold level.
As such, this systemic process sought to assess the validity of the program learning objectives. It also investigated the program effects on teaching conditions and practices. Finally, it identified to what extent the students were able to meet the learning outcomes of the study program.
Mapping Abilities and Skill Levels Using Rasch Techniques
Tim McNamara, Anne Brown, Catherine Elder, Tom Lumley
This paper considers the use of the results of Rasch IRT analysis of data from tests of reading and listening skills in certification, diagnosis and placement.
A central concept in Rasch IRT analysis is the mapping of item difficulty and person ability onto a single scale. This allows the possibility of empirical definition and validation of levels of achievement or proficiency in the skill being assessed, represented by the clustering of items testing similar sub-skills at points along the ability/difficulty continuum defined by the analysis. It further allows description of the ability of individuals or groups in terms of the skills so identified. The mapping process can thus be used to generate criterion statements of skill levels for purposes of certification, selection or placement. Individual Student Profiles can also be generated to provide diagnostic information on individual candidates. The potential of Rasch analysis in this regard is illustrated with examples from two recent test development projects: one a scheme to promote the learning of foreign languages in the junior secondary school, the other a test of English for academic purposes.
Crucial to the above uses of this mapping procedure is the association of sets of items of given difficulty with specific kinds of sub-skills. How straightforward or problematic is this? Research on the relationship of item types to difficulty levels on a 55 item subtest of reading comprehension in a new EAP test is reported, and the difficulties encountered discussed. A proposal for a program of further research in this area is defined.
Differential Person Functioning on Language Proficiency Tests
Kyle Perkins, Grant Henning
Considerable attention has been given to recent studies of differential item functioning (DIF) on language proficiency tests (e.g., Alderman and Holland 1981; Chen and Henning 1985; Kunan 1990). These and other studies employ a variety of methods to examine whether persons of matched ability drawn from different language, gender, or ethnic groups perform differently on the same test items. In this way, such studies seek to identify systematic item performance differences that may inform the study of person groups. The presence and direction of such item performance differences may also impact upon the validity of inferences drawn about person ability on the basis of test scores.
Less attention has been given to the related phenomenon of differential person functioning (DPF) on language proficiency tests (e.g., Henning 1990). DPF studies imply DIF methodology; but unlike DIF studies, DPF studies ask whether items of matched difficulty drawn from different format or content domains function differently for the same persons. The presence of systematic DPF may signify test multidimensionality and may have implications regarding construct validity.
The proposed study will examine differential person functioning for ESL/EFL students responding to TOEFL test items of matched difficulty drawn from different content/format domains of the TOEFL test. Suggestions will be offered about alternative DPF methodologies and about item difficulty regions and content domains where DPF is most evident. This study is also intended to contribute to a better understanding of domain theory.
Accommodative Questions in Oral Proficiency Interviews
Steven Ross
The processes involved in defining oral proficiency in second languages have to date not involved detailed analysis of the discourse characteristic of oral proficiency interviews. The present study considers the phenomenon of variation in questions posed by interviewers at key junctures in the interview process. Based on variable rule analyses of sixteen full-length oral proficiency interviews, it is argued that perceptions of oral proficiency are reflected in the extent of accommodation in interviewer questioning, and that the extent of accommodation may provide a powerful factor in determining oral proficiency as well as a criterion for interviewer training.
Discourse Validation of a Direct Versus a Semi-Direct Oral Test
Elana Shohamy
Establishing validity is an on-going process in which questions of different types are asked about a test. This paper illustrates the different phases needed in validation studies and the importance of studying tests from a variety of perspectives using the validation of two oral tests as an example.
Major attention has been given recently to the development of semi-direct oral tests (SOPI) as substitutes for more direct tests such as the Oral Proficiency Interview (OPI). In a number of studies the two tests correlated highly with one another (Stansfield and Kenyon 1989; Shohamy and Stansfield 1990). In a later study (Shohamy et al. 1991) the two tests were compared with regard to their discourse strategies and grammar showing that while the grammatical features of the two tests were the same, there were differences in a number of discourse strategies: paraphrasing-higher in SOPI; switch to L1-higher in OPI; error correction higher in SOPI.
This study is the third step in the validation study which examined whether the two tests, or the language samples obtained from them, were actually the same. Features known to exist in oral discourse were examined as to their emergence in the two tests. The specific discourse genre and a variety of discourse elements such as elaboration, expansion of ideas, switch of topics, and a variety of prosodic features, were examined and compared. Results showed that the two tests are not the same, they represent different discourse and contain different discourse elements. The language on the SOPI contains features of literate language and is similar in some ways to written language as the interlocutor is not present during the test. The OPI exemplifies conversational discourse features, typical of oral language that is contextualized, expanded, elaborated, etc. However, both tests do not include a large number of oral discourse features which are typical of oral language. The extent to which each of the tests is more construct valid (i.e., better represents the construct of oral proficiency), the implications of the results to oral testing in general and the steps which should be included in validation studies will be addressed.
Comparing the Scaling of Speaking Tasks by Language Teachers
and by the ACTFL Guidelines
Charles Stansfield, Dorry Kenyon
Do classroom language teachers perceive the level of ability required to perform speaking tasks of various complexity in a manner compatible with the level of ability required for those tasks as outlined in the Guidelines for Speaking Proficiency developed by ACTFL? This paper compares the scaling of 38 speaking tasks in terms of the level of ability required to perform each one as perceived by language teachers in public schools with the scaling of the same speaking tasks according to the ACTFL Guidelines.
700 language teachers in Texas were randomly surveyed and requested to rate 38 foreign language speaking tasks, ranging from "Introduce Yourself" to "Discuss a Professional Topic", on a scale of 1 to 5 in answer to the question "Is the level of ability required to perform this task needed by language teachers in Texas public schools?" Responses were received from 62 French teachers, 121 Spanish teachers,and 240 bilingual education teachers. Data was scaled using the many-faceted Rasch analysis.
This paper presents the results of this survey by teacher group, and compares those results with the a-prior scaling of the tasks according to the ACTFL Guidelines. Similarities in scaling of the tasks support the conclusion that the ACTFL Guidelines represent a proficiency hierarchy perceived even by language teachers untrained in the proficiency movement.
The Impact of TWE Reader Qualifying Criteria on Reader Performance
Carol Taylor, Gerald DeMauro, Barbara Voltmer, Bruce Henderson
The test of Written English (TWE) program requires that all potential TWE readers demonstrate their ability to apply established scoring criteria accurately and consistently to TWE essays either by qualifying through a one-day reader training session or by reading successfully for other ETS essay reading programs.
The purpose of the present study is to document current TWE reader training procedures and evaluate reader performance data from TWE reader qualifying sessions and subsequent TWE readings to determine:
- how qualifying performance relates to later reader performance
- which variables examined from qualifying performance are the best predictors of successful reader performance
- whether or not decision points for qualifying and evaluating readers need to be adjusted.
The current measures of TWE reader performance include the following:
- number of essays read as first and second reader
- correlation of scores with those of other assigned readers
- discrepancy with other assigned readers
- ratio of standard deviations of scores
- mean score differences.
The measures of trainee qualifying performance include comparative statistics of the agreement between trainees and expert readers.
Multiple correlations and discriminant analyses, respectively, will be used to evaluate the relationships among the reader training and performance variables and to determine training variable values that best identify qualified readers. Samples included 619 reader trainees and 275 readers. Preliminary analyses suggest that current reader evaluation criteria enable the program to identify competent readers. Further descriptive analyses will be used to explain the outcomes of the study.
Validation of Measures of Grammatical Knowledge and Communicative Ability
Carolyn Turner, Jack Upshur
This paper will report the results of a construct validation study of 14 tests of grammatical knowledge (GK) and communicative ability (CA). This is the second phase of an investigation of two competing theories of second language learning. The two theories are: 1) attainment of grammatical knowledge in a second language precedes the development of communicative ability; 2) communication with limited linguistic resources provides the necessary condition for the development of grammatical knowledge. The study is being conducted in three phases. The final phase will be a cross-lagged time series study of GK and CA.
The first phase involved construction of tests of GK and CA utilizing eight different test methods. Fourteen of the 16 tests were tried out for feasibility, and appropriate modifications were made as a result of that trial. The rationale for the project and the results of the first phase were reported at LTRC 1991.
The second phase of the study involves the construct validation of seven tests each of two hypothesized second language traits, namely GK and CA. The 14 tests will be administered to a sample (n approximately 120) of learners from the same population which will be sampled for the final phase of the project. These are French speaking grade 5 students in an intensive ESL program in the greater Montreal area. Two different organizations of the test results will be analyzed. A multitrait-multimethod matrix of two traits and seven methods will be analyzed according to the original Campbell-Fiske criteria and by means of LISREL. Assumptions required for identification will be noted. The tests' results will be organized also according to traits and facets of method as latent variables. These data will also be analyzed by means of LISREL.
We will report the MTMM matrix and summarize the findings about validity. We will also report the LISREL results and the structural models tested. Any differences between MTMM and facets-of-method results will be discussed. We will be able to report whether we have convergently and divergently valid tests of GK and CA which will allow continuation into the third phase of the study.
A Communicative Test in Analysis
Ingrid Wijgh
Final examinations for second language reading comprehension in the Netherlands consist of four long texts and about 40 multiple choice questions and of a so called "communicative part". In the communicative part there is a large variety of types of texts. In general the texts are short, authentic, printed in the original lay-out, and questions are based on the communicative approach.
A research project has been conducted to validate the communicative part, that is to answer the following questions:
1. In authentic texts learners will have to deal with a large number of unknown words. Does this affect their performance in a negative way?
2. Are the strategies used in answering the questions task-based or learner-based?
3. Are learners able to choose the most efficient strategies?
4. Do learners use information provided by the context?
Protocol-analysis (adapted version of Elshout's 7-steps method) was used to gather data on learners behaviour. Thirteen items of reading comprehension were given to 13 learners coming from two different schools. Data were collected on tape and analyzed afterwards. The main findings were: that learners use one basic strategy which is not always the most efficient one, that they do use the information provided by the context but only as a second choice and that unknown words do not bother them.
Protocol analysis also provided a lot of information about the test items. This method could be very useful as part of test development construction.
Discourse Variation in Oral Proficiency Interviews
Richard Young, Michael Milanovic
In this paper, a theoretical model of dyadic native speaker/non-native speaker (NS/NNS) discourse is described in terms of three features: interactional contingency, goal orientation of participants, and dominance. The model is then used to study the discourse of 30 dyadic oral interviews of the Cambridge First Certificate in English (FCE) examination.
The results of the study demonstrate the effectiveness of the model in abstracting the structure of oral interview discourse. They show that the discourse of oral proficiency interviews is characterized by greater reactiveness by NNS candidates and greater orientation toward goals by NS examiners. Variation in the structure of the discourse is also investigated in this study. This is shown to be related to the examiner, the theme of the interview, the task in which the participants are engaged and the gender of the examiner and candidate.
Posters
TSE Validation in Federal Government Language Programs
Eduardo Cascallar, Marijke Walker
This is a validation study of the TSE in a government setting. This validation was carried out against ratings in the current version of the Federal Interagency Language Roundtable (FILR) Oral Proficiency Interview, and against supervisors' evaluations of the participating government employees, as regards their speaking proficiency on the job. Reliability of the ratings between the government raters, and between these raters and ETS scores of the same tapes was also analyzed. The analyses also take into account several other background variables. In particular, it was important to study the validity of the TSE to evaluate the functional foreign language speaking ability in this new context, and to obtain an accurate and immediate indication of language proficiency, as well as to suggest what type of further training might be required.
A total of 120 non-native English speaking adult subjects was evaluated with the OPI and the TSE. Reliability of the ratings between the trained government raters, and between the government raters and ETS ratings of the same tapes were obtained. The TSE tapes were double-scored in each site, using a counterbalanced design so that each rater was paired with each other rater on the same number of tapes. Validity was examined using the OPI scores obtained from each examinee. As a second component of the validation study, supervisors of the examinees were asked to evaluate the participating government employees as regards their speaking proficiency on the job. This latter measure was also analyzed together with the TSE and the OPI scores in order to determine the relationship between all of these measures and additional background information including demographic, linguistic, and attitudinal data.
Evidence of a strong positive correlation (r = .79) between TSE (Overall Comprehensibility) scores and OPI ratings has been found. This was achieved without the personnel and logistical costs of the traditional OPI testing. In addition, diagnostic scores for pronunciation, grammar, and fluency were also included in the assessment. Overall, the TSE appears to be a good indicator of the adequacy of the examinees' spoken English proficiency for those situations that have been examined. Implications for further test development, "special purpose oral proficiency testing", and training, will be discussed, as well as detailed results and conclusions.
Validity of Written Indicators of Second Language Proficiency
Alister Cumming, Dean Mellow
This cross-sectional study sought to identify linguistic features which validly indicate second language development, considering factors which often threaten the validity of measures of language proficiency, including the limitations of holistic or general measures, of accuracy-based measures, and of variations according to modality (written vs. oral), genre, and first language. The study analyzed 113 compositions of various genres (letter, argument, summary, cause/effect, description) written by adult students (24 Canadian Francophones and 45 Japanese visiting Canada) in intensive ESL programs. Specific measures such as TOEFL, oral interviews, class placements, and self-reports were used to categorize subjects within each first language group. Texts were analyzed for four features: lexical articles, plural -s nouns, 3rd person s on verbs, and a type/token ratio of different words. Preliminary multivariate analyses indicate that the construct of proficiency, as operationalized accounts for variance in accuracy of article use for both language groups (French, F = 9.6, p < .006; Japanese, F = 4.6, p < .03), but not for the other 3 dependent variables. Further sub-analyses of article and plural -s use are being conducted for hypothesized developmental stages (Pienemann and Johnston 1987; Master 1987; Peyton 1990). Genre differences account for some variation, interacting with article use, plural -s markers, and type-token ratios.
Hypothesis Testing in Construct Validation
Sara Cushing, Brian Lynch
Construct validation in language testing has frequently been investigated through correlational techniques such as factor analysis and multidimensional scaling of test data to discover underlying traits of test takers (e.g., Davidson 1988; Oltman and Sticker 1990). Another approach to construct validation is to use test data to test hypotheses about test taker attributes and their relationship to test scores. For example, one might hypothesize that students whose experience with English comes largely from a foreign language classroom might score better on a discrete-point test of grammar but worse on a test of extended listening than immigrant students who have learned English as a second language in an English-speaking country.
UCLA's recently revised English as a Second Language Placement Examination (ESLPE) provides a rich source of data for such hypothesis-testing. The test consists of three subtests - Listening/Notetaking, Reading/Vocabulary, and Composition - which attempt to mirror the academic language skills required of university students. This paper describes efforts taken to validate the construct of "Academic Language Proficiency" (ALP) through hypothesis testing. The test, along with a discrete-point grammar test from an earlier version of the ESLPE, will be given to approximately 600 UCLA students in Fall 1991, who fall roughly into two categories: undergraduate students who were immigrants, and graduate students who are in the United States on student visas. Data from these tests will be used to investigate hypotheses such as the one stated above in an attempt to come to a better understanding of ALP and how it can be tapped in a university placement examination.Canadian Test of English for Scholars and Trainees: CanTEST
Margaret Des Brisay, Jennifer St.John
The Canadian Test of English for Scholars and Trainees (CanTEST), in its English version and the French version, the Test pour étudiants et stagiaires au Canada (TESTCan), is a sophisticated test of language ability originally developed for Chinese candidates entering a Canadian academic and/or work environment. The CanTEST item bank has been used to compile tests for use in Canadian-funded projects in Indonesia and for use as an admissions test at several Canadian universities. A new version for short-term trainees has also been developed.
To date, the CanTEST has been successfully used with over 5,000 candidates. The CanTEST includes measures of four language skills (reading, writing, speaking and listening). Scores are reported in each of the four skills areas using a "band system" that relates test scores to a descriptive statement about the candidate's ability. Special features of the CanTEST include a test of skimming and scanning skills.
The CanTEST has been specifically designed to be responsive to the training objectives of Canadian-funded overseas language centres, and to the requirements of individual aid programs. As well, material is chosen for its compatibility with the educational background and world knowledge of those taking the test.
The poster session will focus primarily on two sub-tests which illustrate the special features described above. Presenters will also deal with questions concerning the suitability and availability of the CanTEST versions for other programs.
CHEMSPEAK: An Update
Dan Douglas, Larry Selinker
CHEMSPEAK, a specific purpose test of oral English ability for teaching assistants in chemistry, was reported on at the 1991 LTRC. It was suggested, based on somewhat limited data, that there may be a measurement advantage to using a field-specific language test, when the purpose is to make field-specific judgements. Since then, more data have been collected and a clearer picture of test performance has emerged.
In the poster session we will report on the latest analysis of results and elaborate on a procedure for constructing such field-specific tests, which primarily involves the manipulation of method facets, and a validation procedure which includes an interlanguage analysis of subjects' test performance.
Multiple-Choice Summary: A Measure of Test Comprehension
Ari Huhta, Elina Randell
The study aimed at finding an easily administrable alternative to summaries, open-ended questions, and other test types which purport to measure the comprehension of the main points in a text. Finnish LSP teachers, who would like to test the understanding of main ideas, need such a method because often they have to test several hundred students at a time and give out the results within only a few days. Valencia and Pearson (1987) suggest that the multiple-choice summary (a text accompanied by alternative summaries one of which is the best one) might be a good reading test. However, there appears to be no studies on the matter.
About 300 students (humanists and social scientists) at the University of Jyv Skyl took a reading test which consisted of conventional multiple-choice and open-ended questions, a summarization task, and a multiple-choice (MC) summary. The students also filled in a questionnaire with questions about, for example, perceived sources of difficulty in reading and the face validity of the tests and texts.
The design of distractors in the MC summaries was based on features characteristic of poor (conventional) summaries.
The hypothesis was that those who do well on the tasks and questions aimed at understanding the main ideas should be the same ones who choose the right alternative in the MC summaries. Also, those who claim to be good at finding the main ideas in a text should be the same ones who make the right MC choice. An analysis of variance suggests that these hypotheses were partly correct.
The main problems with the new method is reliability. We tried to tackle the problem by asking the testees to justify their decisions in order to check whether they were just guessing. These justifications, however, seem difficult to interpret.
Implementation Issues: National Dutch Language Certification Examinations
Jan Hulstijn
In 1991, the Dutch Minister of Education appointed a committee to advise him on the establishment of a system which would make it possible for adult non-native speakers of Dutch to acquire language certificates, and provide information on the anticipated civil effect (the degree of acceptance) of such certificates in Dutch society. The committee report, published in November 1991, was well received. The poster will give details about the recommendations and the implementation process currently underway.
TOEFL 2000: Future Directions
Jacqueline Ross, Susan Chyn
Since 1976 the Test of English as a Foreign Language (TOEFL) has been a three-section, multiple-choice test consisting of test items designed to assess listening comprehension, structure and written expression, and vocabulary and reading comprehension.
While there is a notable body of research supporting the validity and reliability of the TOEFL exam as it exists today, there is also a concomitant recognition that the exam might be improved. Program staff and external committee members are exploring ways to refine the TOEFL within certain constraints. This poster session will present program needs and test development activities directed toward the design of the TOEFL as it might be presented in the future.
The presenters will actively invite audience participation and encourage attendees to offer suggestions as to how English proficiency might best be evaluated in the future to be of most benefit to all of TOEFL's constituents, examinees, instructors and score users.
The A-LINC Assessment
Helen Tegenfeldt, Virginia Monk
Immigration Canada is starting long-range planning for a future nationwide program to provide language training for immigrants, which they have termed Language Instruction for Newcomers to Canada (LINC). Part of the planning process includes developing an assessment to determine who needs the language training, and how much of the sponsored language training is appropriate for that person. This assessment will be used in Employment and Immigration Centres (EIC) across Canada, and since it will be used to access the LINC program, it is being called the A-LINC.
The A-LINC targets only lower levels of English ability, from pre-literacy to pre-intermediate; it is intended to be given to non-ESL professionals; it takes from 10-20 minutes depending on the level of the client; it is more of a decision-making process than a formal test, and is comprised of a series of tasks that an EIC client is asked to do. At various points throughout the interview the counsellor may decide the client has reached the limit of this English ability and stops the interview, assigning the client to the appropriate level of language training associated with that exit point.
An earlier research project developed the basic form of this assessment, and tested the bottom and top exit points specifically. The current research project has added more exit points, based on EIC requirements, which are being tested on EIC clients, using both ESL and non-ESL interviewers; results will also be correlated with a language training program's assessment of these clients, just prior to their entering class, and then following up with teachers' assessments after a month of class.
The Ontario Test of ESL: OTESL
Marjorie Wesche, Doreen Ready
The purpose of this poster session is to make the OTESL materials available for inspection by interested persons in the field. This battery of performance-based EAP instruments includes a 45-minute Placement Test with listening, reading, writing and speaking components, specifically designed for use in intensive pre-admission ESL programs, as well as a 2 ½ hour Post-Admission Test which measures performance in reading, listening and writing, and an Oral Interaction Test. The latter instruments are designed for students already admitted to academic programs who may still need further English instruction. They measure global proficiency levels according to assessment scales based on functional descriptions, and provide diagnostic feedback on speaking and writing. Alternative science and non-science versions of these two tests are available. The Oral Interaction Test is also noteworthy for the task difficulty hierarchy and adaptive format.
The OTESL instruments were developed from detailed specifications using authentic source materials to reflect academic discourse types, topics and tasks which simulate the language use needs identified for ESL speakers in North American academic situations. The development process and the instruments themselves reflect the considerable advantages as well as the logistical drawbacks of a theme-based performance approach to EAP testing. The test manual provides information on validity and reliability. The OTESL, developed by a team of researchers for the Ontario Ministry of Colleges and Universities from 1983-88, was published by the Joint Language Testing Service of the Universities of Ottawa and Toronto in 1990.Software Applications/Demonstrations
Applications of TESTGRAF in Setting Cut-off Points on ESL Tests
Margaret Des Brisay
The presenter will show both colour monitor displays and hard copies of the output from the TESTGRAF analysis of a multiple choice ESL test battery.
Developed by Dr. Jim Ramsay of McGill University, TESTGRAF provides all the usual indicators of multiple choice item performance including 3 parameter IRT estimates. There are two special features of this program which make it an attractive addition to the present set of software options for language testers. First of all, all output from the analysis including option characteristic curves, test information functions, and a principal components analysis of correct option item characteristics, can be graphically displayed on a colour monitor (or printed out on a postscript printer), making the information easier to assimilate. Secondly, and more importantly, TESTGRAF makes use of the information to be obtained from the various wrong options which were chosen for incorrectly answered items. The output is displayed in "examinee credibility plots" so that, for each examinee, the likelihood or relative credibility of the true proficiency being at value 0 is shown.
Deciding whether an examinee scores beyond a certain cut point makes heavy demands on precision estimates. Using TESTGRAF, it may be possible to ensure that only people who are statistically rather than numerically below a cut point are rejected.
Using the Information Curve to Assess Language CAT Efficiency
Michel Laurier
Since a placement test is designed to assess students general proficiency on a broad ability range, adaptive testing is particularly appropriate for placement purposes. In our study, we developed a computerized adaptive placement test in French at the post-secondary level. This test seeks to measure abilities underlying various uses of the second language and is composed of three sub-tests. The whole test is administered using a micro-computer and is available with two different item selection algorithms both based on IRT procedures. The item banks were also used to create two conventional "paper and pencil" parallel versions.
As reliability indices are meaningless in an IRT context, we used the information curve to assess the CAT efficiency versus the conventional versions. Efficiency is defined as the ratio between the information that is obtained and the number of items that have been presented. The study shows that the adaptive form uses much fewer items than the "paper and pencil" form to achieve the same reliability. However data obtained from post-secondary students using one form as a pre-test and the other one as a post-test show that the test scores cannot be compared because the marking procedures are different.
CAT efficiency is often obtained at the expense of validity because of the constraints of the IRT framework and the technology limitations. Therefore one must be very careful using an adaptive procedure for language testing.
Rapid Profile: On-line Diagnostic Screening of Interlanguage
Acquisition by Expert System
Manfred Pienemann, Ian Thornton, Alison Mackey
Rapid Profile is based on psychologically plausible and objectively measurable reference criteria obtained from second language acquisition research. In our paper we will show that the close relationship between SLA research and our screening procedures ensures a high degree of construct validity. We will also demonstrate that the procedure has a high degree of reliability.
Rapid Profile is a second languages screening procedure which allows the analyst to assess a learner's level of development against standard patterns in the acquisition of the target language. The procedure is based on a sample of natural speech which is rapidly analyzed with the help of a computer program. The procedure has evolved from detailed empirical studies of English as a second language which established fixed and psychologically plausible reference criteria for the screening process.
We will demonstrate the construct validity of Rapid Profile and deal with two major reliability aspects of the procedures: 1) the efficient elicitation of relevant data and 2) the level of accuracy obtained in the observation task through interactive training.
Firstly, we will demonstrate that developmental interlanguage features can be selectively triggered by specific communicative tasks. In an experiment we analyzed the spontaneous speech produced by ESL learners and native speakers in response to a set of communicative tasks. The morpho-syntactic structures produced by the subjects varied predictably according to task. This predictability can be used to increase the speed and reliability of our procedure.
Secondly, we will demonstrate that the accuracy of Rapid Profile analysis is comparable to full linguistic profiling. We trained 10 analysts using an interactive audio CD technique which realistically simulates the observation task. The training software recorded and analyzed the performance of the analysts.
Symposia
The Educational and Social Impact of Language Tests
As validity is the main theme of the 1992 LTRC Colloquium, the symposium will address the expanded view of validity, that which is adopted by Messick (1988, 1989) and others whereby construct validation includes also evidence of tests' use. Specifically, the utilization and consequences of test results by decision makers, the impact that tests have on test takers, teachers, and on the educational and social system in which they operate. Accordingly, the role of the tester does not end in the development phase but needs to examine issues of test use, relevance, ethics and the interpretation of scores by decision makers and students. Thus, "testers cannot be viewed any more as technicians who complete their work when they reach satisfactory reliability coefficients, instead they must also consider the consequences, social, psychological, ethical, curricular, and educational of the tests which they produce."
Introduction
Elana Shohamy, Eduardo Cascallar
The introduction provides the framework through which to view the three papers and outlines the ethical issues raised by language testing.
Does Washback Exist?
Charles Alderson and Diane Wall
The concept of backwash, or washback as it has been named in the language testing literature, i.e., the influence of a test on teaching, is commonly encountered in textbooks on language testing, and its existence, importance and influence is frequently asserted, and very occasionally exemplified by anecdotes. However, a review of the use of the term washback reveals that the concept is rarely defined in the literature, and even less commonly investigated empirically. If washback is a powerful phenomenon as is asserted, then it would seem important to begin to understand what exactly it consists of, what effects if brings about and how. If we are to attempt to engineer beneficial washback with our tests, or even if we wish to use our tests as "levers for change" (Pearson 1988), then we need to define what sort of influence tests in general and tests specifically might and do have on teaching and the curriculum.
This paper is intended to initiate a debate about what washback might consist of, what research has been conducted to date into the "phenomenon" and how as language testers we might devise a series of studies to examine the alleged washback of our tests.
The Power of a Test: A Study on the Effect of a Reading Comprehension Test on Language Learning
Elana Shohamy
The issue of test use has gained major attention recently with the increased use and introduction of new tests as devices aimed at solving major educational problems. There is a belief by educators, politicians and administrators that the power and authority of tests, per se, will be instrumental in saving and upgrading declining academic achievements. Although a number of studies have been conducted recently to investigate the impact that tests have on learning, instruction and the school social structure (Smith 1991; Shepard 1991), not much is known about the impact and use of language tests.
In this paper results of a study that examines the impact of newly introduced language tests on language learning is reported. The main focus will be on a reading comprehension test administered nationally to 5th and 6th grade students, but reference will be made to other language tests. Data was collected through observations of classes, teaching material and other documents produced before and after the test administration (e.g., report cards, lesson plans, teacher notebooks, Ministry of Education notes and reports, and questionnaires and interviews with students, teachers and parents).
Results indicate that tests have a strong impact on learning This impact is mostly instrumental and not conceptual, and varies according to a number of variables such as the purpose of the test, whether the group failed or passed, the status of the principals, etc. The nature of the impact will lead to conclusions regarding possible solutions for integrating tests with instruction, for increased responsibility of testers as to the use of the tests, for sharing testing discourse with the users, and for less use of tests for power and control.
Examining Washback: The Sri Lankan Impact Study
Diane Wall, Charles Alderson
This paper is intended to contribute to the investigation of washback and its alleged effects by reporting on the results of a four-year project looking at a particular public examination in a Third World context, and examining the impact of innovations in language tests on teachers' and pupils' behaviours in the classroom. The paper reports on the use of a variety of classroom observation instruments and question-naires and on the content analysis of a number of teacher-made tests in an attempt to establish the degree to which new tests can be said to have influenced teaching.
The Development and Use of Rating Scales in Language Testing
The four papers in this symposium discuss research issues in the development and application of scales to rate linguistic performance. The papers highlight theoretical issues and illustrate empirical techniques for research use. The objective of the symposium is not only to transmit information, but also to provide a forum for the discussion of issues involved in using raters to judge linguistic performance. As measurement of language becomes more holistic and the use of performance assessments in the measurement field as a whole increases, we believe that scale development will become increasingly important in language testing research.
Introduction
Dorry Kenyon
The introduction sets the stage for the presentation of the papers. Issues common to all the papers are highlighted and a definition of terms is provided.
Direct Scalar Rating vs. Multiple Binary Rating of ESL Learner Writing
Frank Davidson, Liz Hamp-Lyons, Dorry Kenyon
How can raters be encouraged to consider (and use) an entire rating scale rather than "favourite" scores? Multiple Binary Rating (MultiBin) has been suggested as an alternative to Direct Scalar Rating (DSR) as a means to avoid discrepancies in scale-step probabilities. This paper reports on the analysis of data sets from two operational settings in which binary procedures have been attempted. The analysis of qualitative data collected from raters using both DSR and MultiBin is also presented. Discussion focuses on the advantages and disadvantages of MultiBin over DSR.
Validation of a New Holistic Rating Scale Using Rasch Multi-Faceted Analysis
Belle Tyndall, Dorry Kenyon
What happens to the accuracy of placing students within a college EFL program when an analytical scale is replaced by a holistic essay rating scale in which the scale points reflect the program levels? This paper discusses the development and analysis of such a scale. A multi-faceted Rasch analysis is used to investigate how the scale was applied by 11 raters scoring over 200 papers. Correlational analysis and an analysis of misplacements provides further evidence of the new scale's validity. Feedback from raters on how they applied the scale is also discussed.
Developing Rating Scales for the CASE: Theoretical Concerns and Analyses
Michael Milanovic, Nick Saville, Anne Cook
What are the relationships between the development of a rating scale and models of communicative competence, practical considerations of testing procedures, test tasks, and the pool of test raters and test takers? What are the appropriate methods to use to determine appropriate scale points? This paper uses the development of the Cambridge Assessment of Spoken English (CASE, presented at a poster session at LTRC 1991) to illustrate 1) the need to treat the development/validation of rating scales as an integral part of the overall test development process and 2) the analysis of rating scales. Analyses of ratings obtained from "live" pilot administrations of the CASE using Rasch techniques are reported.
Rating Scales and Native Speaker Performance on a Communicatively Oriented EAP Test
Tim McNamara, Jan Hamilton, Eileen Sheridan
What are the implications for the use of native speaker performance as a reference point (explicit or implicit) in scalar descriptions of second language speaker performance? What are we actually testing in communicatively oriented tests of English for academic purposes? This paper examines the construct validity of such tests in the light of a discussion of the concept of "performance testing" in ESP contexts. A distinction is proposed between a strong and a weak sense of the terms "performance test". The extent to which EAP tests are simultaneously tests of language proficiency and other skills (e.g., study skills) is considered. The results of two studies of the performance of native speaker on the IELTS test are examined in the light of the above distinction.
Summary
Dorry Kenyon
Addresses of Program Participants
Charles Alderson
Dept. of Linguistics & Modern English
Language
Bowland College
University of Lancaster
Lancaster LA1 4YT
United Kingdom
Kathleen Bailey
Monterey Institute of International Studies
426 Van Buren Street
Monterey, CA 93940
Anne Brown
Language Testing Centre
Dept. of Linguistics and Language Studies
University of Melbourne
Parkville Victoria 3052
Australia
James Dean Brown
Department of ESL
University of Hawaii
1890 East-West Road
Honolulu, HI 96822
Gary Buck
Monterey Institute for International Studies
425 Van Buren Street
Monterey, CA 93940
Eduardo Cascallar
Educational Testing Service
Rosedale Road, Mail Stop 10P
Princeton, NJ 08541-0001
Susan Chyn
Educational Testing Service
P.O. Box 6155
Princeton, NJ 08541
Caroline Clapham
Dept. of Linguistics
Lancaster University
Bailrigg
Lancaster
LA1 4YT England
Annette Cook
University of Cambridge
Syndicate Buildings
1 Hills Road
Cambridge CB1 2EU
ENGLAND
Alister Cumming
Ontario Institute for Studies in Education
Modern Language Centre
252 Bloor Street W.
Toronto, Ontario M5S 1V6
Sara Cushing
TESL Centre & Applied Linguistics
U.C.L.A.
3300 Rolfe Hall
Los Angeles, CA 90024
Fred Davidson
DEIL, 3070 FLB, UIUC
707 S. Matthews
Urbana, IL 61801
Alan Davies
University of Edinburgh
Department of Applied Linguistics
14 Buccleuch Place
Edinburgh E8H 9LN
Scotland, U.K.
John de Jong
CITO
P.O. Box 1034
6801 Mg Arnhem
The Netherlands
Gerald DeMauro
Educational Testing Service
Mail Stop 30-P
Rosedale Road
Princeton NJ 08541
Margaret Des Brisay
Second Language Institute
University of Ottawa
600 King Edward
Ottawa, Ontario
K1N 6N5
Emily Detmer
Dept. of English as a Second Language
University of Hawaii
1890 East-West Road
Honolulu, Hawaii 96822
Dan Douglas
1105 Brookridge
Ames, IA 50010
Catherine Elder
Language Testing Centre
Dept. of Linguistics and Language Studies
University of Melbourne
Parkville Victoria 3052
Australia
Janna Fox
Centre for Applied Language Studies
215 Paterson Hall
Carleton University
Ottawa, Ontario
K1S 5B6
Claire Gordon
Department of English
The Open University of Isreal
16 Klausner Street
Ramat Aviv, Tel-Aviv 61392
Israel
Jan Hamilton
Language Testing Centre
Dept. of Linguistics and Language Studies
University of Melbourne
Parkville, Victoria 3052
Australia
Liz Hamp-Lyons
Department of English and Applied Linguistics
Campus Box 175, P.O. Box 173364
University of Colorado at Denver
Denver, CO 80217-3364
Grant Henning
Educational Testing Service
Mail Stop 10-P
Rosedale Road
Princeton NJ 08541
Thom Hudson
Department of ESL
University of Hawaii
1890 East-West Road
Honolulu, Hawaii 96822
Ari Huhta
University of Jyva Skyla
Language Centre for Finnish Universities
P.O. Box 35
SF-40351 Jyva Skyla
Finland
Jan H. Hulstijn
Applied Linguistics Department
University of Amsterdam
De Boelelann 1105
1081 HV Amsterdam
Netherlands
Dorry Kenyon
Center for Applied Linguistics
1118 22nd Street N.W.
Washington, D.C. 20037
Michel Laurier
Université de Montréal
Faculté des sciences de l'éducation
C.P. 6128, succursale A
Montréal (Québec)
H3C 3J7
Tom Lumley
Language Testing Centre
Dept. of Linguistics and Language Studies
University of Melbourne
Parkville, Victoria 3052
AUSTRALIA
Denise Lussier
Faculty of Education
McGill University
3700 McTavish
Montreal, Quebec H3A 1Y2
Brian Lynch
P.O. Box 1926
Beverly Hills, CA
90213 U.S.A.
Alison Mackey
L.A.R.C.
University of Sydney
Sydney, Australia
NSW 2006
Tim McNamara
Language Testing Centre
Dept. of Linguistics and Language Studies
University of Melbourne
Parkville, Victoria 3052
Australia
Dean Mellow
University of British Columbia
Centre for the Study of Curriculum and Instruction
2125 Main Mall
Vancouver, B.C. V6T 1Z4
Michael Milanovic
University of Cambridge
Syndicate Buildings
1 Hills Road
Cambridge CB1 2EU
ENGLAND
Virginia Monk
Vancouver Community College
Box N 24620 Station "C"
Vancouver, B.C.
V5T HN4
Kyle Perkins
Linguistics Department
Southern Illinois University
Carbondale IL 62901
Manfred Pienemann
L.A.R.C.
University of Sydney
Sydney, Australia
NSW 2006
Don Porter
Welwyn Garden City
Herts, England
AL8 7PW
Timothy Pychyl
Department of Psychology
Carleton University
Ottawa, Ontario K1S 5B6Elina Randell
University Jyva Skyla
Language Centre for Finnish Universities
P.O. Box 35
40351 Jyva Skyla
Finland
Doreen Ready
University of Ottawa
Second Language Institute
600 King Edward Avenue
Ottawa, Ontario
K1N 6N5
Jacqueline Ross
Educational Testing Service
P.O. Box 6155
Princeton, NJ 08541
Steven Ross
Department of English as a Second Language
University of Hawaii at Manoa
1890 East-West Road
Honolulu, HI 96822
Nick Saville
University of Cambridge
Local Examinations Syndicate
Syndicate Buildings, 1 Hills Road
Cambridge CB1 2EU
England
Larry Selinker
ELI, University of Michigan
Ann Arbor, MI
48109 U.S.A.
Eileen Sheridan
Language Testing Centre
Dept. of Linguistics and Language Studies
University of Melbourne
Parkville, Victoria 3052
Australia
Bruce Shields Henderson
Santa Clara University
Room 223, St. Joseph's Hall
Santa Clara, CA 95033
Elana Shohamy
Tel Aviv University
School of Education
Tel Aviv, Israel
69978
Peter Skehan
53 The Avenue
Muswell Hill
London WC1H
ENGLAND
Bernard Spolsky
Department of English
Bar Ilan University
52-100 Ramat-Gan
Israel
Jennifer St.John
Second Language Institute
University of Ottawa
600 King Edward
Ottawa, Ontario
K1N 6N5
Charles W. Stansfield
Center for Applied Linguistics
1118 22nd Street, N.W.
Washington D.C. 20037
Carol Taylor
International Testing and Training Programs
Educational Testing Service
Princeton, N.J. 08541
Helen K. Tegenfeldt
Vancouver Community College
1155 E. Broadway
Box 24620 Station C
Vancouver B.C. V5T 4N3
Ian Thornton
L.A.R.C.
University of Sydney
Sydney, Australia
NSW 2006
Carolyn Turner
Faculty of Education
McGill University
3700 McTavish
Montreal, Quebec
H3A 1YABelle Tyndall
Department of English as a Foreign Language
Academic Center, T607
The George Washington University
Washington, D.C. 20052
John Upshur
TESOL Centre
Condordia University
1455 Le Maisonneuve Blvd. West
Montreal Quebec H3G 1M8
Barbara Voltmer
Bay Area Essay Reading Office
Educational Testing Service
Emeryville, CA
Marijke Walker
Federal Bureau of Investigation
Language Services Unit, Room GRB 210
Washington, D.C. 20535
Dianne Wall
I.E.L.E.
University of Lancaster
Lancaster LA1 4YT
ENGLAND
Marjorie Wesche
Second Language Institute
University of Ottawa
600 King Edward Avenue
Ottawa, Ontario
K1N 6N5
Ingrid F. Wijgh
Numankade 8
Utrecht, Holland
3572 KZ
Richard Young
Department of Linguistics
University of Southern Illinois at Carbondale
Carbondale, Illinois 62901-4517
Bruno D. Zumbo
Faculty of Education
University of Ottawa
Ottawa, Ontario K1N 6N5