LANGUAGE TESTING RESEARCH COLLOQUIUM
Cambridge & Arnhem, 1993

Steering Committee

- Mari Wesche

- John de Jong

- Dorry Kenyon

Vetting Committee

- Ludo Verhoeven

- John de Jong

- Mike Milanovic

- Mari Wesche

- Dorry Kenyon

Local Organisers Cambridge: Local Organisers Arnhem:

- Mike Milanovic - John de Jong

- Nick Saville - Inge Hermsen

- Helen Goring - Beleza Joséphy

- Chris Banks - Marion Feddema

- Mary Darkin - José Noijons

- Adam Hooper - Ludo Verhoeven

LANGUAGE TESTING RESEARCH COLLOQUIUM

Part 1

2nd - 5th August, 1993

"Performance Testing"

Queens' College, Cambridge

Monday 2nd August

1.00 - 6.00 pm Registration at the Registration Office

2.00 4.00 pm Walking tour of Cambridge city and colleges

(meet at the main gate)

6.30 6.45 pm Welcome from John Reddaway, Secretary General of UCLES in the Fitzpatrick Hall

6.45 7.30 pm Presentation by Bernard Spolsky in the Fitzpatrick Hall

Testing the English of foreign students in 1930. (42)

7.30 8.00 pm Welcoming drinks reception in the Old Hall

8.00 9.00 pm Buffet in the Old Hall


Publishers Exhibition
Publishers stands will be located in the Angevin room throughout the conference.


Tuesday 3rd August

7.30 9.00 am Breakfast in the New Hall

9.00 9.15 am Opening and Introduction in the Fitzpatrick Hall

9.15 11.00 am Paper session 1 in the Fitzpatrick Hall

Chair: Alan Davies

Elana Shohamy, Smadar Donitsa-Schmidt, and Ronit Waizer:

The effect of the elicitation modes on the oral language obtained on language tests. (41)

Richard Berwick and Steven Ross:

Crosscultural pragmatics in oral proficiency interview strategies. (3)

Sarah Briggs:

Using 'real life' academic challenges for evaluating communicative skills in English. (5)

11.00 11.30 am Coffee/tea break in Cripps Court

11.30 12.40 pm Paper session 2 in the Fitzpatrick Hall

Chair: Andrew Cohen

Annie Brown:

The effect of rater variables in the development of an occupationspecific language performance test. (6)

Michael Milanovic, Nick Saville, and Shen Shuhong:

A study of the decisionmaking behaviour of composition markers. (32)

12.40 1.45 pm Lunch in the New Hall

1.45 3.30 pm Paper session 3 in the Fitzpatrick Hall

Chair: Caroline Clapham

Charles Stansfield and Dorry Mann Kenyon:

Evaluating the efficacy of rater selftraining. (44)

Tom Lumley and Tim McNamara:

Rater characteristics and rater bias: implications for training. (28)

Anne Lazaraton:

A qualitative approach to monitoring examiner conduct in the Cambridge Assessment of Spoken English (CASE). (25)

3.30 4.00 pm Tea/coffee break in Cripps Court

4.00 5.10 pm Paper session 4 in the Fitzpatrick Hall

Chair: Dan Douglas

Kieran O'Loughlin:

The assessment of writing by English and ESL teachers. (34)

Carol Taylor:

A study of writing tasks assigned in academic degree programs. (45)

5.10 5.40 pm Presentation of posters 15 in the Fitzpatrick Hall

5.40 6.10 pm Poster session (15) in the Bowett Room

Poster Presentations

1 Robert Courchene and Doreen Ready:

Summary Cloze: What is it? What does it measure?

2 Cathie Elder:

Are raters' judgements of language teacher effectiveness wholly languagebased?

3 Liz HampLyons:

Applying ethical standards to portfolio assessment in ESL.

4 Yasmeen Lukmani:

Linguistic accuracy versus coherence in assessing examination answers in content subjects.

5 Paul McCann and A.Teasdale:

A pilot study into task difficulty factors in a test of English for specific purposes.

6.10 7.30 pm Tour of Cambridge Botanic Gardens (meet at the main gate)

7.30 9.00 pm Dinner in the New Hall

Wednesday 4th August

7.30 9.00 am Breakfast in the New Hall

9.00 9.15 am Opening in the Fitzpatrick Hall

9.15 11.00 am Paper session 5 in the Fitzpatrick Hall

Chair: Charles Alderson

Lyle Bachman, Brian Lynch, and Maureen Mason:

Investigating variability in tasks and rater judgements in a performance test of foreign language ability. (1)

Tim MacNamara and Tom Lumley:

The effects of interlocutor and assessment mode variables in offshore assessment of speaking skills in occupational settings. (31)

Alastair Pollitt and Neil Murray:

What oral raters really pay attention to. (36)

11.00 11.30 am Coffee/tea break in Cripps Court

11.30 12.40 pm Paper session 6 in the Fitzpatrick Hall

Chair: Elana Shohamy

Gillian Wigglesworth and Kieran O'Loughlin:

An investigation into the comparability of direct and semi direct versions of an oral interaction test. (49)

John de Jong, Jan Mets, and Fellyanka Stoyanova:

Large scale testing of oral proficiency. (14)

12.40 1.45 pm Lunch in the New Hall

1.45 3.30 pm Paper session 7 in the Fitzpatrick Hall

Chair: John Upshur

Alex Teasdale:

Content validity in tests for well defined LSP domains: an approach to defining what is being tested. (46)

Eduardo Cascallar and Marijke Cascallar-Walker:

New technologies for the assessment of oral proficiency in second language learners. (9)

Elaine Wylie:

An aspect of the reliability of the speaking test of the International English Language Testing System (IELTS). (50)

3.30 4.00 pm Tea/coffee break in Cripps Court

4.00 - 4.30 pm Summary Presentation

4.30 4.55 pm Presentation of posters 69 in the Fitzpatrick Hall

4.55 5.25 pm Poster session (69) in the Bowett Room

Poster Presentations

6 Don Porter:

Performance conditions: stability of effects across cultural groups and task types in the assessment of spoken language.

7 Daniel Reed:

Procedures for improving the predictive value of indirect tests, based on scores from the OPI and TWE, and an itembyitem analysis of a multiplechoice examination.

8 Marcia Reeves:

The impact of rater quality control and training/retraining procedures on rater reliability in scoring Test of Spoken English response tapes.

9 MarieChristine Sprengers:

The feasibility of the guided summary as a test of reading comprehension.

5.30 7.00 pm Opentopped doubledecker bus tour of Cambridge area

(meet at the main gate)

7.15 - 7.45 pm Drinks reception in the Old Hall

7.45 8.00 pm Conference photograph

8.00 10.00 pm Gala dinner in the New Hall

Thursday 5th August

Departure and transfer to Arnhem

7.30 9.00 am Breakfast in the New Hall

9.30 am Delegates vacate rooms

LANGUAGE TESTING RESEARCH COLLOQUIUM

Part 2

5th8th August, 1993

"Communication, Cognition & Assessment"

Hotel Haarhuis/CITO, Arnhem

Thursday August 5th (continued), Arrival & registration

4.00 pm Registration desk open at Haarhuis Hotel

6.00 pm Get Together Party, offered by CITO

8.00 pm Dinner

Friday August 6th

9.00 - 9.10 am Reopening of LTRC'93 by Ton van den Hout, General Director of CITO

9.15 - 10.15 am Modelling and assessment:

Chair: Douglas Stevenson

9.15 - 9.30 am James Dean Brown and Jaqueline Ross:

Decision Dependability of Item Types, Sections, Tests, and the Overall TOEFL Test Battery. (7)

9.35 - 9.50 am Eduardo Cascallar, Marijke Cascallar-Walker, Pardee Lowe, and James Child:

Development of New Proficiency and Performance Based Skill Level Descriptors for Translation: Theory and Practice. (10)

9.55 - 10.15 am Discussion

10.20 - 10.50 am Coffee/Tea break

10.55 - 11.55 am Assessing Multi-Faceted Skills:

Chair: Huub van den Bergh

10.55 - 11.10 am Tony Lee:

Taking a multi-faceted view of the uni-dimensional measurement from Rash analysis in language tests. (26)

11.15 - 11.40 am Micheline Chalhoub:

Performance Assessment and the Components of the Oral Construct Across Different Tasks and Rater Groups. (11)

11.45 - 11.55 am Discussion

12.00 - 1.30 pm Lunch break

1.35 - 2.35 pm Summative Assessment Procedures:

Chair: Kees de Glopper

1.35 - 1.50 pm Sari Luoma:

Validating a general purpose foreign language test for adults. (29)

1.55 - 2.10 pm Marjorie Wesche, T. Sima Paribakht, and Doreen Ready:

A Comparative Study of Four Placement Instruments. (48)

2.15 - 2.35 pm Discussion

2.40 - 3.10 pm Tea/Coffee break

3.15 - 4.15 pm Assessing Lexical Skills:

Chair: Margaret Des Brisay

3.15 - 3.30 am John Read:

The development of a new measure of L2 vocabulary knowledge. (38)

3.35 - 3.50 pm Swathi Vanniarajan: CANCELLED

Issues in the Design of Vocabulary Testing: Cognition and Assessment.

3.55 - 4.15 pm Discussion

4.30 - 5.30 pm LTRC Business meeting

5.45 - 6.15 pm Presentation of TOEFL AWARD 1993

6.30 - 7.00 pm Session reserved for discussing issues of

Language Testing Journal

7.45 - ?? am Dinner followed by Party Night

Saturday August 7th

8.30 - 10.05 am Reading Assessment:

Chair: Mari Wesche

8.35 - 8.50 am Graig Deville and Micheline Chalhoub:

Modified Scoring, Traditional Item Analysis, and Sato's Caution Index Used to Investigate the Reading Recall Protocol. (17)

8.55 - 9.10 am Frans Kleintjes, Gerrit Staphorsius and Norman Verhelst:

Predicting the Comprehension of Prose Varying by Difficulty: An application of the One Parameter Logistic Model. (23)

9.15 - 9.30 am Rob van Krieken:

Equating National Exams of Reading Comprehension in the Foreign Language. (24)

9.35 - 10.05 am Discussion

10.10 - 10.40 am Coffee break

10.45 - 11.45 am Cognition and Assessment:

Chair: Ludo Verhoeven

10.45 - 11.00 am Alison Green:

Cognitive Aspects of Performance and Assessment of the Meno Communicative Skills Components. (20)

11.05 - 11.20 am Christa Hansen and Christine Jensen:

The effect of prior knowledge on EAP listening test performance. (22)

11.25 - 11.40 am Eduardo Cascallar:

A new Cognitive Approach for the Assessment of Language Aptitude. (8)

11.45 - 12.15 am Discussion

12.20 - 1.50 pm Lunch break

2.00 - 3.00 pm Multilingual Issues:

Chair: Nan Yeld

2.00 - 2.15 pm Ludo Verhoeven:

Early bilingualism, cognition and assessment. (47)

2.20 - 2.35 pm Margaret Des Brisay and Lise Duquette:

Acquisition versus Learning as Variables in the Trait Structure of L2 Test Data Sets. (16)

2.40 - 3.00 pm Discussion

3.00 - 3.15 pm Transfer to CITO main building

3.15 - 6.30 pm Workshops & Poster Sessions

(tea available)

Poster Presentations

1 Jean-Guy Blais and Michel Laurier

The Dimensionality of a Placement Test Components.

2 Alan Davies

Are Tests of Extensive Reading Different from Tests of Intensive Reading?

3 Margaret Des Brisay

A Methodology for Comparing ESL Admissions Tests.

4 Claire Gordon and David Hanauer

Test Answers as Indicators of Mental Model Construction.

5 Alastair Pollitt

Reporting reading test results in grades.

6 Susan Zammit

Motivation, test results, gender differences and foreign languages: How do they connect?

Computer workshops

1 José Noijons

Towards a checklist for Computer Aided Language Testing (33)

2 Jared Bernstein

Performance Testing with Speech Recognition: Methods and Validation (2)

3 Free Access for All

- Test Analysis with the One Parameter Logistic Model

- LangCred Database on Language Tests and Training Materials

7.15 - ?? pm Surprise Tour and Dinner Offered by CITO

Closing of LTRC'93

Sunday August 8th

Before noon Delegates vacate rooms and check out of hotel

9.00 - 12.00 pm ILTA Business Meeting

12.30 - 2.00 pm Lunch

2.37 pm Train for Amsterdam

Registration and Welcome Cocktails AILA 1993

Monday August 9th - Friday August 13th

10th AILA World Congress

List of all papers, posters, and workshops

presented at the

15th Language Testing Research Colloquium

Cambridge and Arnhem

August 2-7, 1993

1 Lyle F. Bachman, Brian Lynch, and Maureen Mason

Investigating Variability in Tasks and Rater Judgments in a Performance Test of Foreign Language Ability

2 Jared Bernstein

Performance Testing with Speech Recognition: Methods and Validation

3 Richard Berwick and Steven Ross

Cross-cultural Pragmatics in Oral Proficiency Interview Strategies

4 Jean-Guy Blais and Michel Laurier

The Dimensionality of a Placement Test Components

5 Sarah L. Briggs

Using 'Real Life' Academic Challenges for Evaluating Communicative Skills in English

6 Annie Brown

The Effect of Rater Variables in the Development of an Occupation-Specific Language Performance Test

7 James Dean Brown and Jacqueline A. Ross

Decision Dependability of Item Types, Sections, Tests, and the Overall TOEFL Test Battery

8 Eduardo C. Cascallar

A new Cognitive Approach for the Assessment of Language Aptitude

9 Eduardo C. Cascallar and Marijke I. Cascallar

New Technologies for the Assessment of Oral Proficiency in Second Language Learners

10 Eduardo C. Cascallar, Marijke I. Cascallar, Pardee Lowe, and James Child

Development of New Proficiency and Performance Based Skill Level Descriptors for Translation: Theory and Practice

11 Micheline Chalhoub-Deville

Performance Assessment and the Components of the Oral Construct Across Different Tasks and Rater Groups

12 Robert Courchêne and Doreen Ready

Summary Cloze: What is it? What does it measure?

13 Alan Davies and Aileen Irvine

Comparing Test Difficulty and Text Readability in the Evaluation of an Extensive Reading Programme

14 John De Jong, Jan Mets, and Fellyanka Stoyanova

Large Scale Testing of Oral Proficiency

15 Margaret Des Brisay

A Methodology for Comparing ESL Admissions Tests

16 Margaret Des Brisay, Lise Duquette, and Mohamed Dirir

Acquisition versus Learning as Variables in the Trait Structure of L2 Test Data Sets

17 Graig W. Deville and Micheline Chalhoub-Deville

Modified Scoring, Traditional Item Analysis, and Sato's Caution Index Used to Investigate the Reading Recall Protocol

18 Catherine Elder

Are Raters' Judgements of Language Teacher Effectiveness Wholly Language-Based?

19 Claire Gordon and David Hanauer

Test Answers as Indicators of Mental Model Construction

20 Alison J.K. Green

Cognitive Aspects of Performance and Assessment of the Meno Communicative Skills Components

21 Liz Hamp-Lyons

Applying Ethical Standards to Portfolio Assessment in ESL

22 Christa Hansen and Christine Jensen

The Effect of Prior Knowledge on EAP Listening Test Performance

23 Frans G.M. Kleintjes, Gerrit Staphorsius, and Norman D. Verhelst

Predicting the Comprehension of Prose Varying by Difficulty: An Application of the One Parameter Logistic Model

24 Rob van Krieken

Equating National Exams of Reading Comprehension in the Foreign Language

25 Anne Lazaraton

A Qualitative Approach to Monitoring Examiner Conduct in the Cambridge Assessment of Spoken English (CASE)

26 Tony Lee

Taking a Multi-Faceted View of the Unidimensional Measurement from Rash Analysis in Language Tests

27 Yasmeen Lukmani

Linguistic Accuracy versus Coherence in Assessing Examination Answers in Content Subjects

28 Tom Lumley and Tim McNamara

Rater Characteristics and Rater Bias: Implications for Training

29 Sari Luoma

Validating the Certificates of Foreign Language Proficiency: The usefulness of qualitative validation techniques

30 Paul McCann and Alex Teasdale

A Pilot Study into Task Difficulty Factors in a Test of English for Specific Purposes

31 Tim McNamara and Tom Lumley

The Effect of Interlocutor and Assessment Mode Variables in Offshore Assessments of Speaking Skills in Occupational Settings

32 Michael Milanovic, Nick Saville, and Shen Shu Hong

A Study of the Decision-Making Behaviour of Composition

33 José Noijons

Towards a Checklist for Computer Aided Language Testing (CALT)

34 Kieran O'Loughlin

The Assessment of Writing by English and ESL Teachers

35 Alastair Pollitt

Reporting Reading Test Results in Grades

36 Alastair Pollitt and Neil L. Murray

What Raters Really Pay Attention to

37 Don Porter

Performance Conditions: Stability of Effects Across Cultural Groups and Task Types in the Assessment of Spoken Language

38 John Read

The Development of a New Measure of L2 Vocabulary Knowledge

39 Daniel J. Reed

Procedures for Improving the Predictive Value of Indirect Tests: Study Based on Scores from the OPI and TWE, and an Item-by-Item Analysis of a Multiple-Choice Examination

40 Marcia J. Reeves

The Impact of Rater Quality Control and Training/Retraining Procedures on Rater Reliability in Scoring Test of Spoken English Response Tapes

41 Elana Shohamy, Smadar Donitsa-Schmidt, and Ronit Waizer

The Effect of the Elicitation Modes on the Oral Language Obtained on Language Tests

42 Bernard Spolsky

Testing the English of Foreign Students in 1930

43 Marie-Christine Sprengers

The Feasibility of the Guided Summary as a Test of Reading Comprehension

44 Charles W. Stansfield and Dorry Mann Kenyon

Evaluating the Efficacy of Rater Self-Training

45 Carol Taylor

A study of Writing Tasks Assigned in Academic Degree Programs

46 Alex Teasdale

Content Validity in Tests for Well-defined LSP Domains: An Approach to Defining What is to be Tested

47 Ludo Verhoeven

Early Bilingualism, Cognition, and Assessment

48 Marjorie Wesche, T. Sima Paribakht, and Doreen Ready

A Comparative Study of Four Placement Instruments

49 Gillian Wigglesworth and Kieran O'Loughlin

An Investigation into the Comparability of Direct and Semi Direct Versions of an Oral Interaction Test

50 Elaine Wylie

An Aspect of the Reliability of the Speaking Test of the International English Testing System (IELTS)

51 Susan A. Zammit

Motivation, Test Results, Gender Differences and Foreign Languages: How do they Connect?

Abstracts

Investigating Variability in Tasks and Rater Judgments in a Performance Test of Foreign Language Ability

Lyle F. Bachman, Brian Lynch, and Maureen Mason

Department of TESL/Applied Linguistics, UCLA

Much of the recent debate that has surrounded the development and use of "performance", or "communicative" language tests has focused on a supposed trade-off between two sets of desirable qualities. It has been argued that test tasks and test performance need to correspond in demonstrable ways to non-test language use. This suggests that we should design tests with characteristics such as thematic unity, or coherence, and task dependency, which seem to be features of non-test discourse, and that we place the greatest emphasis on the use of information about content relevance for supporting validity of test use. At the same time, we realize that we are accountable for demonstrating that scores derived from test performance are reliable, and that the inferences about language ability that we make from them are valid.

One area that has been of particular concern with performance tests is the potential variability in tasks and rater judgments, and this has been investigated in the language testing literature with two complementary approaches: generalizability theory (e.g., Bolus et al. 1982, Stansfield & Kenyon 1991) and multifaceted Rasch modelling (e.g., McNamara 1991). GENOVA (Crick & Brennan 1983), which performs generalizability-theory analyses, estimates the relative contribution of variation in test tasks and rater judgments to variation in test scores. FACETS (Linacre 1993), which performs multifaceted Rasch modelling, estimates differences in task difficulty and rater severity, and adjusts ability estimates of test takers, taking these differences into account.

In this paper we first discuss the design and development of a foreign language test battery that was designed for two purposes: 1) to place University of California Education Abroad students into programs at universities abroad that are appropriate for their level of language ability, and 2) to provide diagnostic information that will be useful for designing appropriate teaching and learning programs for prospective education abroad students. The test battery consists of four subtests: 1) reading, 2) listening and note-taking, 3) speaking, and 4) writing. All subtests share a common theme or topic, and are interdependent. The listening and reading tests are based on genuine materials; the speaking tasks are semi-direct and are based on the content of the listening test. The materials for the reading test are topically related to the material in the listening test. The writing test is a composition prompt that requires test takers to integrate information from both the listening and reading tests.

We then discuss the results of analyses using GENOVA and FACETS, based on a full field-trial with a group of University of California undergraduate students who had been selected for participation in the Education Abroad Program. Finally, we discuss the implications of these results for the use of G-theory and multifaceted Rasch modelling for the development of performance tests of foreign language ability.

Performance Testing with Speech Recognition: Methods and Validation

Jared Bernstein

Speech Research Program, SRI International

Background

Speech recognition systems improved significantly in the 1980s. One outcome of the new level of recognition performance, particularly speaker independence and vocabulary independence, has been that some new applications become much easier to develop.

Purpose

The research objective was to determine the feasibility of automatically evaluating the pronunciation of English sentences read aloud by native Japanese speakers. We sought a method for automatically deriving pronunciation scores for spoken sentences that correlated well with those assigned by human expert listeners. In principle, such a method would be useful not only for testing students' spoken language skills, but also as an element in a more comprehensive system providing diagnosis and instruction.

Method

A series of projects developed automatic systems to score speech production of non-native speakers of English. The systems undertook increasingly more difficult tasks: first, Japanese adults recorded with a high quality microphone; finally, Japanese middle-school children speaking over normal telephone channels. The automatic grading procedure first aligned the speech with a model and then compared the segments of the speech signal with models of those segments that have been developed from a database of speech from 10,000 native speakers of English. In the final stages of the project, the system was validated on a data base of 5,000 utterances from Japanese speakers of English. SRI measured the performance of the automatic systems by calculating a correlation between automatically generated scores and the average of the scores assigned by three expert human listeners.

Results

With high quality speech signals, correlations of overall speech production quality have been as high as 0.9, and about 0.8 with telephone speech. The paper will discuss the relation of signal duration to grading reliability, and its implication for the diagnosis of specific speaking difficulties.

Implications

Primarily, these studies offer hope that meaningful measures of spoken language performance can be generated automatically at low cost and with wide geographic coverage. Among several challenges remaining, the most central is to generalize the methods to work with a wider range of first and second languages.

Cross-cultural Pragmatics in Oral Proficiency Interview Strategies

Richard Berwick and Steven Ross

The University of British Columbia, The University of Hawaii at Manoa

Theoretical background and rationale

The influence of discourse and pragmatic transfer in cross-cultural encounters has received very little consideration in studies on the construct validity of performance tests. With the current emphasis on direct assessment of speaking proficiency following the methodology of the Oral Proficiency Interview (OPI), the potential importance of cross-cultural pragmatics is evident.

Purposes of the research

The present study examines a) possible effects of interviewers' culturally-based communicative and pragmatic systems on the interview discourse and b) potential threats to the validity of the procedure which appear related to interviewers' differential use of accommodation and control strategies.

Research design and methods

Six English as a second language interviews were matched according to proficiency ratings with six Japanese as a second language interviews. We employ Mann-Whitney U Tests to establish categorical comparisons between interviewers for the accommodation and control exponents and then discuss the realizations of these exponents in the interview discourse.

Results

Tallies of exponents of accommodation and control from the twelve interviews indicate a clear tendency on the part of the Japanese as second language interviewer to avoid interactional trouble and communication breakdown by providing highly accommodative questioning and topic-maintaining turns. Frequent use of display questions, over-articulation of the content of questions and statements, and lexical simplification were in particular found to characterize the Japanese as second language interview discourse.

Implications of the results

The study suggests that comparable outcomes according to rating criteria focused exclusively on the speech of the interviewee may not be equivalent in terms of the discourse and accommodation utilized by the interviewers, and that interviewer strategies for avoiding trouble may be linked to underlying cultural and pragmatic phenomena. The major implication is that rater training procedures should include a focus on the speech of the interviewer in determining the extent of accommodation manifest in the interview discourse.

The Dimensionality of a Placement Test Components

Jean-Guy Blais and Michel Laurier

Université de Montréal

Theoretical background and rationale

Most theoretical views on language describe the communicative competence as a multiple component construct (Bachman 1990). On the other hand, IRT procedures are commonly used to design language tests for placement, certification or selection purposes. These tests aim at measuring a general proficiency trait and seem to meet the unidimensionality assumption.

Purpose of the research

In order to find the most appropriate group for students enroling FSL classes at the post-secondary level, a placement test has been designed. Using IRT procedures, two paper-and-pencil versions were created and a computerized adaptive version has been set up. The purpose of this study is to determine to which extend the unidimensionality assumption is met.

Research design and method

The present test includes three different sub-tests, (1) a reading part that focuses on a general comprehension ability which may be related to the knowledge of text content, (2) a "Fill-the-gap" part that mostly measures grammatical features and in some cases the vocabulary range, and (3) a part where the student must determine the most appropriate statement in a given situation which refers to various sociolinguistic aspects.

Different methods to assess the unidimensionality are used, (a) exploratory analysis with a Lisrel model: Based on the internal consistency of the sub-tests, we analyze a single factor and a 3-factor model, (b) factor analysis of the item correlation matrix: A TESTFACT factor analysis is conducted on the whole test and on the three sub-tests, (c) essential unidimensionality study: The Stout's (1987) non-parametric approach is used to assess departure from unidimensionality, and (d) multiple calibration: The Bejar's (1980) method is used to compare the 3-parameter BILOG calibrations on various parts of the sub-tests.

Results

It seems that even though we may recognize different abilities from a theoretical point of view, the assumption of unidimensionality is generally met within the sub-tests. However the degree to which a test departs from unidimensionality is not easy to assess and it must be established with different methods.

Implications of the results

The results of this study will help us in adding new items in the bank and in creating new components. They will be also helpful to interpret correctly the general score that is obtained to make proper placement decisions.

References

Bachman, L.F. 1990: Fundamental Considerations in Language Testing. Oxford: Oxford University Press.

Bejar, I.I. 1980: A procedure for investigating the unidimensionality of achievement tests based on item parameter estimates. Journal of Educational Measurement 17, 183-296.

Stout W. 1987: A non parametric approach for assessing latent trait unidimensionality. Psychometrika 52, 589-617.

Using 'Real Life' Academic Challenges for Evaluating Communicative Skills in English

Sarah L. Briggs

The University of Michigan

Theoretical background and rationale

A March, 1992, survey of 113 graduate students (77 in engineering and hard sciences; 36 in humanities and social sciences) at a large research university in the United States revealed that the most difficult tasks the students faced in graduate school involved using spoken English. Knowing when and how to present one's views, asking questions and presenting short talks in small group seminars and discussing academic matters individually with peers and faculty were identified as salient tasks for test development.

Purpose of the research

The intent of this research is to develop an oral assessment procedure which reflects the perceived challenges facing incoming graduate students and to determine whether such a measure is more informative than an individual oral interview.

Research design and methods

A small-group interaction task whose content is linked with an academic writing task was piloted on three groups of incoming graduate students in January and April, 1993. Ratings of speaking performance in the small-group interaction are compared to results on other EAP test measures including an individual oral interview given to incoming graduate students. Subject reaction to the small-group interaction task was collected and analyzed.

Results

Preliminary results indicate that decisions about adequacy of speaking skills are similar whether the decision is based on an individual oral interview or on the small-group interaction task performance. It also appears that performance on the small-group interaction task is related to performance on the academic writing task to which it is linked. Subjects generally react positively to a group method of assessing speaking ability but have concerns about its validity when a student is too challenged by the task.

Implications of the results

While it is feasible, in some circumstances, to assess speaking ability using a small-group interaction task that simulates a 'real life' academic challenge, it may be that the same aspects of communicative ability are revealed in some writing tasks and in oral interviews.

The Effect of Rater Variables in the Development of an Occupation-Specific Language Performance Test

Annie Brown

University of Melbourne

Theoretical background and rationale

In occupation-specific performance tests, where tasks are designed to simulate the actual work situation, it is often appropriate to include not only assessments of general linguistic competence but also assessments involving a perception of professional competence. This has implications for the selection of appropriate judges: should they be language educators or representatives of the profession?

Purposes of the research

This paper explores the effect that rater background has on assessments made in an occupation-specific oral language test, the Japanese Language Test for Tour Guides. This advanced-level test of Japanese as a foreign language involves five role-play simulations of guide-client interaction designed to allow for evaluation not only in terms of linguistic criteria, but also in terms of 'real world' criteria, that is, how successfully the candidate has carried out the simulated tasks required of the professional role.

The context in which the test is to be introduced (i.e., as part of an industry-driven accreditation scheme for tour guides), required the researcher to consider as assessors people with a range of backgrounds, more specifically, both native and near-native speakers of Japanese and with backgrounds either mainly in teaching Japanese as a foreign language or in tour guiding in Japanese.

Research design and methods

Multi-faceted Rasch analysis (Linacre 1989) provides information on rater behaviour where subjective assessments are to be made. Such an analysis was carried out on assessments of videotaped performances of 51 test candidates made by 33 such assessors in order to determine what effect background has on assessments made on both linguistic and 'real world' criteria.

Results

Differences were found in the pattern of assessment according to rater background. However, these were minor differences and did not point to the unsuitability of any particular group. On the most crucial issues pertaining to acceptability as raters (harshness, reliability, bias, ranking of candidates), given adequate training and explicit assessment criteria there is little evidence that native speakers are more suitable than non-native speakers, or that raters with a teaching background are more suitable than those with an industry background (or vice versa).

Implications

The findings of this analysis have obvious implications for the selection and training of raters for occupation-specific language performance tests as well as contributing to ongoing research on rater characteristics in performance assessment.

Reference

Linacre, J.M. (1989) Many-facet Rasch Measurement. Chicago; MESA Press.

Decision Dependability of Item Types, Sections, Tests, and the Overall TOEFL Test Battery

James Dean Brown and Jacqueline A. Ross

University of Hawaii at Manoa, Educational Testing Service

Theoretical background and rationale

The reliability of the TOEFL battery and component parts have repeatedly been shown to be high. In addition, the standard error of measurement has long been used as a means for estimating the average unreliable variance across all scores.

Purposes of the research

The purpose of this study was to examine the reliability of TOEFL in several new ways. We investigated the relative contributions to score dependability (analogous to classical theory reliability) of various numbers of items and subtests as well as the decision dependability at different cut points. To do so, four research questions were formulated. These research questions apply to the overall TOEFL battery as well as to its various tests and subtests:

1. What are the classical theory reliability estimates?

2. What are the relative contributions to error variance of persons, items, subtests, and their interactions?

3. What is the dependability for varying numbers of items and subtests?

4. What is the effect on score dependability of various cut-points?

Research design and methods

The study was based on the item responses of 20,000 test takers from 15 different language backgrounds. The data were collected from the May 1991 administration of the TOEFL at foreign and domestic test centers. The first test in the TOEFL battery is a listening test including three item types: statement items, dialogue-based items, and minitalk items. The second test covers two item types: structure and written expression. The third test consists of vocabulary and reading comprehension items.

Results

The results include descriptive statistics, classical theory reliability estimates, generalizability theory, and decision dependability [phi(lambda)] estimates for various cut points.

Implications of the results

The implications are discussed in terms of the dependability of using various combinations of TOEFL total, test, and subtest scores, as well as the dependability of decisions made at various cut points. Such issues are important because high decision dependability is a precondition for attaining high "systemic" validity.

A new Cognitive Approach for the Assessment of Language Aptitude

Eduardo C. Cascallar

Center for the Advancement of Language Learning

This paper will explore the rationale and suggested uses of various cognitive processing measures to predict language learning aptitude. The direct application of this type of testing is intended to be in the context of overall assessment for personnel selected to participate in language training at various intensive adult language programs. In addition, it is expected that these same variables and the theoretical underpinnings of them, will also be applicable to language learning and assessment in general.

The new instrument to be presented includes several cognitive variables which have been chosen based on evidence of their important participation in the processing of language information, and the central role they play in the overall functioning of the cognitive system. These cognitive variables represent cognitive abilities inherent to the system, and underlying both low and high level cognitive functions. It has been shown that individuals differ fundamentally in the components of a four factor model of the cognitive ability space. This model has been used to predict success in a variety of learning tasks. These sources make independent contributions to a variety of learning outcomes and they account for much of the variability in learning success. Recently, cognitive psychologists have also began to consider the processes by which students acquire new cognitive skills. In the context of some testing programs, it is important to identify some of the cognitive factors that relate specifically to the performance of the examinees, and that are relevant in the prediction of future performance in language training.

Working memory capacity is one factor that can help explain test performance. Recent developments in the conceptualization of working memory capacity have noted its significance as a determinant of efficiency in cognitive processing. Processing speed is also examined in relation to the capacity measures. The role of individual differences in working memory capacity and speed of processing are examined as they relate to performance in the different tests in the battery. Similarly, declarative and procedural knowledge are evaluated to assess the current state of network structure and of procedural skills. Declarative and procedural learning are assessed to estimate the effectiveness of the system to integrate new inputs into the network and into added procedural lists. The development of the instrument, theoretical underpinnings, and initial results will be discussed.

New Technologies for the Assessment of Oral Proficiency in Second Language Learners

Eduardo C. Cascallar and Marijke I. Cascallar

Center for the Advancement of Language Learning, Federal Bureau of Investigation

Theoretical background

This study examines the assessment of oral proficiency and pronunciation in language programs of the US Federal Government. The research included the validation of the TSE (Test of Spoken English) against ratings in the current version of the Interagency Language Roundtable (ILR) Oral Proficiency Interview, and against supervisors' evaluations of the participating government employees. The analyses also take into account several other background variables. It was important to obtain an accurate and immediate indication of language proficiency, as well as to suggest what type of further training might be required. To achieve the goal of the automated evaluation of pronunciation, the primary task was to implement a system that takes a short sample of speech, spoken and recorded in uncontrolled telephone conditions.

Method

A total of 200 non-native English speaking adult subjects were evaluated with the OPI and the TSE. Reliability of the ratings between the trained government raters, and between the government raters and ETS rating of the same tapes were obtained. The TSE tapes were double-scored in each site. Validity was examined using the OPI scores obtained from each examinee. In addition, pronunciation was scored automatically with a computer implementation, using digitized speech from the subjects' tapes. Correlations between human and automated scores of pronunciation will be discussed.

Results

Evidence of a strong positive correlation between TSE (Overall Comprehensibility) scores and OPI ratings has been found. This was achieved without the personnel and logistical costs of the traditional OPI testing. In addition, diagnostic scores for pronunciation, grammar, and fluency were also included in the assessment. The TSE appears to be a good indicator of the adequacy of the examinees' spoken English proficiency for those situations that have!been examined. The variance explained by each of the rated dimensions (e.g., pronunciation) in assessing overall intelligibility of the candidate's spoken English was determined.

Implications

Discussion of the results will focus on the construct of pronunciation, fluency and comprehensibility, and will highlight the relationships found in terms of language level and group. Implications for further test development, "special purpose oral proficiency testing", and training will also be discussed. The technology involved in the study was determined to have practical applications for the program, and it provides a desirable alternative to the current evaluation of pronunciation by human raters.

Development of New Proficiency and Performance Based Skill Level Descriptors for Translation: Theory and Practice

Eduardo C. Cascallar, Marijke I. Cascallar, Pardee Lowe, and James Child

Center for the Advancement of Language Roundtable, USA

The US government has had a long tradition in translation testing of applicants for translator positions. In the past ten years the several government agencies have attempted to develop reliable and valid measures which could yield accurate information regarding applicants' translation ability. In addition, these measures were needed to provide relevant cut-off scores and information pertinent to the selection of personnel capable of addressing the translation requirements of the government language programs. Three sets of translation skill level descriptions were produced in the past ten years. The reason for this continued effort was an attempt to refine criteria statements in order to make them more accurate and useful for the assessment of translation proficiency and performance. The most recent set of translation skill level descriptions has been developed in collaboration with the Testing Committee of the federal Interagency Language Roundtable, the Center for the Advancement of Language Learning, in a project partially sponsored by Educational Testing Service.

This presentation will describe the history of the previous efforts to develop the translation criteria, and the new level descriptors recently developed as part of the project here reported. These developments are based on a new theoretical model that conceptualizes the process of translation. These model also addresses the need to account for the proficiency levels in both donor and receiver languages. Therefore, the new translation criteria are based on the assessment of proficiency in reading the donor language, proficiency in writing in the receptor language, and in the evaluation of a new "congruity scale" for the product of the translation. It is assumed that translation is an integrative skill, and that the ability to make congruity judgments and to manifest them in the rendering is what is believed to be unique to translation. Recent studies that provide experimental evidence for the cognitive processes involved in translation, and which provide confirmatory evidence for the model proposed will also be presented.

All skill level descriptors and the new Congruity Scale will be explained. The impact of this new instrument in government procedures and assessment of translator applicants will also be addressed. The audience will be encouraged to participate and discuss the new scales.

Performance Assessment and the Components of the Oral Construct Across Different Tasks and Rater Groups

Micheline Chalhoub-Deville

The Ohio State University

Rationale

New perspectives on L2 oral testing increasingly call for more performance based tests. Performance oral testing, however, presents a new set of problems, central to all being the issue of validating scores obtained from these tests. What do total scores obtained from L2 oral tests mean?

Purpose

The purpose of the study was to derive the dimensions underlying L2 holistic oral scores of learners of Arabic from three elicitation tasks (interview, narration, and read aloud) across three different rater groups.

Methodology

The three rater groups were: native speakers (NSs) who teach Arabic as a foreign language to adults in the U.S.; non-teaching NSs who have been residing in the U.S. for a period of at least one year; and non-teaching NSs living in Lebanon. The present paper also investigated whether the three rating groups used dimensions differentially to assess the L2 learner's speech.

Holistic scores were analyzed using the INDSCAL model within the ALSCAL multidimensional (MDS) scaling program on SPSS (1990). Unidimensional scale ratings, obtained from each rater, along with speech sample analysis were used to aid explication of the MDS solutions.

Results

Three dimensions emerged in the ALSCAL solution. Results corroborated the documented effect of elicitation task on speech products. Analyses also showed that the three rating groups used dimensions differentially to assess the L2 learner's speech.

Implications

This study argues that performance assessment should not employ generic, a priori component scales but that scales need to be empirically derived according to the given task and audience. In addition, results indicate that our knowledge of how NSs assess non-native speech is lacking, and thus in need of improvement for the purpose of training teachers and developing scales.

Summary Cloze: What is it? What does it measure?

Robert Courchêne and Doreen Ready

Second Language Institute University of Ottawa

In this paper the authors report on a study using a new testing format-summary cloze-to determine (1) how it functions as a measure of reading ability when compared with the multiple-choice format and (2) how it correlates with other measures of language proficiency- listening, reading, skimming and scanning and canonical cloze.

Summary cloze passages are constructed by writing a one-third-in length summary of a reading passage that contains the essential meaning but uses different lexical and syntactic forms to express it. Words (usually lexical/cohesive entries) are then deleted from the summary based on the rational-cloze model. Students are provided with a list of possible choices exceeding the number of answers in a ratio of 5:4. To complete the task, students must first read the original passage and then use this knowledge to complete the summary cloze.

Five reading texts had been prepared with multiple choice (MC) questions and a summary cloze (SC) version. The subjects were 66 Chinese students at intermediate to advanced levels of ESL matched for language ability and then randomly assigned to one of two groups. Each group did two texts in one format and two in the other and both groups did both format versions of the fifth text. After each reading the subjects completed a Reader Assessment Questionnaire focusing on text difficulty, background knowledge and interest.

Statistical analyses included comparisons of the mean scores for both question formats for each text and each Reader Assessment Questionnaire. Correlations were done using the scores obtained for each format by text with other measures of language proficiency. Internal consistency measures of reliability were compared for each question format within each text.

Results indicate that the summary cloze yields similar scores and correlations with other measures of language proficiency to those in multiple choice format while providing more consistent estimates of internal reliability.

Comparing Test Difficulty and Text Readability in the Evaluation of an Extensive Reading Programme

Alan Davies and Aileen Irvine

University of Edinburgh, Department of Applied Linguistics

It is commonly asserted that the content dependence of achievement tests puts upper limits on the validity of which they are capable. The research reported in this paper considers the possibility that this assertion may not always be justified.

The paper describes the construction and validation of a set of achievement tests of extensive reading. The tests were devised for the use of the Hong Kong Education Dept. to assess progress on their extensive reading scheme. The scheme consists of 2000+ published titles (known as 'readers'), graded into 8 levels based on publishers' allocations, ranging from non-simplified to simplified. An experiment involving an interlocking design of alternate reading comprehension tests at each of the 8 levels is described.

Logit item difficulty values derived from a Rasch one-parameter analysis (using the Quest program) indicate that the tests of the two least simplified and the two most simplified texts discriminated as intended but the 4 middle tests in the sequence discriminated less well. Such a finding, it is suggested, may be the result either of faulty test construction or of misplacement in the original allocation of readers to the 8 levels. To test the latter possibility the texts used in the 8 reading comprehension tests were measured for readability on the Gunning-Fog Index. The product moment correlation between the reading scheme allocation of levels and the readability measures was 0.73; that between the scheme allocation and the mean difficulties of the tests 0.88.

It is concluded that while there are problems with appropriate ordering of test difficulty, the test difficulty levels do compensate to some extent for the inadequacy of the allocation of readers to the 8 levels. Reasons for the misallocation of readers to the 8 levels are considered and, given the claim for the built-in redundancy of the tests, the practical use in Hong Kong of the most valid sub-set of the tests is considered.

Large Scale Testing of Oral Proficiency

John De Jong, Jan Mets, and Fellyanka Stoyanova

CITO, Dutch National Institute for Educational Measurement

Background and rationale

In December 1991 the Dutch Education Secretary tendered a project for the development, construction, and administration of National Examinations of Dutch as a second language. The examinations were to yield certificates for reading, writing, listening, and speaking and to be attuned to two distinct domains of language use; the first associated with vocational training and occupational settings, the second with higher education and the professions.

Purposes of the research

The project entailed, quite apart from the development of the tests proper, the definition of the construct, procedures to ensure social acceptation of the certificates to be awarded to successful candidates, the distribution of example tests, and the transparency of the examination system. In this paper the specific goals were to evaluate (1) the operation of the speaking test and its relation to the other skills, (2) the dimensionality of the total set of tests, and (3) to investigate the relation of the passing scores for the four skills.

Research design and methods

This paper focuses on the development and research of the tests for speaking. The following procedures were used: definition of subskills, choice of appropriate item format, and pilot testing. During the pilot testing several rating procedures were tried out and a rater training program was developed. Moreover the pilot testing made use of a section pre-equating design in order to link the new exams to existing measures of Dutch as a second language available in the Netherlands and in Belgium. The selection and calibration of items was based on Item Response Theory, using a member of the family of one-parameter Rasch models. This specific member operationalized in the computer program OPLM allows for Conditional and Marginal Maximum Likelihood estimation of item difficulty parameters for both dichotomously and polytomously scored items and, additionally, the imputation of empirically derived discrimination parameters.

Results

Using this design it was possible, even before the first test was administered, to set passing standards that could be related to known levels of ability and that were thus transparent and acceptable to the distinct groups who are to make use of the tests, viz., the candidates, their teachers, receiving institutes for further education, and employers. In a post-hoc analysis it was found that score differences on the tests for all four skills could be expressed along a single dimension and that the passing scores differed in the amount of ability required along this dimension.

Implications

The paper concludes with an evaluation of the development of the oral proficiency tests and a discussion of a number of indicators used to estimate interrater reliability and the scaling procedures. The procedure to establish the relationships between the measures for listening, reading, speaking, and writing will lead to further research into a model of language proficiency. More directly, the observed differences in ability required for each of the four skills call for an adaptation of the passing scores in future versions of the tests.

A Methodology for Comparing ESL Admissions Tests

Margaret Des Brisay

University of Ottawa

An increasing number of programs and institutions have developed tests of English for Academic Purposes to be used in making admissions decisions at North American universities. It is not unreasonable for admissions officers to request information that will enable them to compare scores from a new and unfamiliar test with scores from the tests they have traditionally used. It is important,however, that the right questions be asked and this is not always the case.

What admissions officers frequently want is a conversion table calibrating scores from different tests whereas the real question is not how well do two tests measure each other but how well does each test measure the construct of interest and in order to answer this question, researchers have focused their efforts on investigating the construct validity of different EAP tests (Bachman et al, 1988).

Nevertheless, test scores are used as a basis for action and it is important to provide decision makers with information that has applied utility. Standard equating methods cannot be used as the assumptions basic to their derivation cannot be fulfilled (Mislevy, 1992). This paper specifies a methodology for data collection, compares appropriate statistical methods for data analysis including estimates of decision consistency, decision agreement, and shared construct relevant variance. The studies on which this paper is based involved four groups of examinees (totalling 250) who wrote both the Test of English as a Foreign Language (TOEFL) and one other EAP-type test.

References

Bachman, L.F., Vanniarajan, A.K.S. and Brian Lynch, (1988). Task and ability analysis as a basis for examining content and construct comparability in two EFL proficiency test batteries. Language Testing, 5, 128-159.

Mislevy, Robert J. (1992). Linking Educational Assessments: Concepts, Issues, Methods and Prospects. Princeton, NJ. Educational Testing Service.

Acquisition versus Learning as Variables in the Trait Structure of L2 Test Data Sets

Margaret Des Brisay, Lise Duquette, and Mohamed Dirir

University of Ottawa, University of Ottawa, University of Massachusetts

The unidimensionality of language test data has been a contentious issue for over 15 years with most studies reaffirming a unitary factor statistical model for language ability. Even when more than one factor was found, additional factors appeared to be neither construct nor skill-related (Davidson, 1988).

This paper reports the results of factor analyses undertaken to investigate the appropriateness of IRT models to language test data from two versions of a test of Français Langue Seconde and two versions of a test of English as a Second Language. (N= 671 to 831). The tests are written to the same set of specifications and used for the same purpose.

Data were analyzed using both linear (SAS) and non linear factor analysis. The non-linear model used was NOHARM (Fraser, 1986). Non-linear analysis is preferred for dimensionality assessment as it does not assume linear relationships among the variables but gives loadings with are probabilistically related to the construct. While unidimensional models do fit subpopulations assumed to be more homogeneous with respect to learning experience, both linear and non-linear models suggest a two (ESL) or three (FLS) factor for entire data sets.

In most studies, examinees have had fairly uniform language learning opportunities and differences in ability reflect the extent to which individuals were able to profit from these opportunities. In the present study, examinees were native speakers of one or the other of the two languages being tested but were much more heterogeneous with respect to the process by which they had acquired L2; immersion programs, regular high school programs, bilingual families, frequent contact with native speakers. Difficulty in satisfying model assumptions and obtaining invariance of item parameters motivated efforts to relate the factorial structure of these test data to the language learning experiences of the examinees. Although no systematic relationship was found between item performance and strategies hypothesized to characterize each of the two classes of learners, certain response patterns in the cloze section merit further investigation.

References

Davidson, F. G. (1988). An Exploratory Modelling Survey of the Trait Structure of Some Existing Language Test Datasets; unpublished Ph.D dissertation. University of California at Los Angeles.

Fraser C. (1988). NOHARM: A computer program for fitting both unidimensional and multidimensional normal ogive models of latent trait theory. Armisdale, Australia. The University of New England, Centre for Behavioral Studies.

Modified Scoring, Traditional Item Analysis, and Sato's Caution Index Used to Investigate the Reading Recall Protocol

Graig W. Deville and Micheline Chalhoub-Deville

The Ohio State University

Rationale

The written recall protocol is increasingly being used in second language reading research as a measure of comprehension. To date, however, the only reliability analyses reported for this procedure have been intra/inter-rater consistency estimates. Although the recall protocol is an essay-like instrument, the total score derived is based entirely on summing the discrete propositions correctly recalled. Consequently, item and reliability analyses similar to those run on multiple choice tests can and should be performed on recall protocols and on other integrative measures.

Purpose

The purpose of this study is to demonstrate how modified scoring, item and internal consistency analyses, along with Sato Caution indices can be used to evaluate the quality of the recall protocol as a reading comprehension measure. Issues concerning the assumptions underlying classical local independence are discussed with regard to the reading recall protocol and other integrative measures.

Methodology

Classical test theory measures, Sato caution indices, and tests of unidimensionality were applied to the recall protocols of 56 subjects who had read a short prose passage in German.

Results

Results indicate that the procedures can be applied and can yield interpretable results. These results need to be replicated using other texts and other weighting systems.

Implications

Only when recall protocols are routinely subjected to item and reliability analyses comparable to those performed on other measures, can the instrument be considered a viable alternative. In addition, the procedures described in the present paper need to be applied to other integrative measures that utilize summated item scores.

Are Raters' Judgements of Language Teacher Effectiveness Wholly Language-Based?

Catherine Elder

University of Melbourne

Theoretical background and rationale

The role of factors other than language in the assessment of face-to-face interaction is widely acknowledged but underresearched. In occupation-specific performance tests where raters are drawn from the relevant occupational area it is likely that perceptions of professional competence will have a bearing on language proficiency assessments.

Purposes of the research

The purpose of the research is to examine the extent to which occupation-specific factors are separable from linguistic ones in the assessment of foreign language teacher proficiency and to consider the consequences of including these factors in the assessment process.

Research design and methods

The context for the study is a recently-developed performance test for the accreditation of Italian language teachers, in which speaking tasks are designed to reflect the linguistic demands of the teaching situation. Assessment criteria on this test invite judgements not only about general linguistic skills but also about classroom competence (ie the extent to which candidates in their execution of test tasks, succeed in creating the conditions for language learning).

75 videotaped samples of test performance were assessed at least twice by 15 trained raters (native and near-native Italian speakers with language teaching expertise) with each rater marking a common core of tapes. Multifaceted Rasch analyses and correlations were carried out on the data to determine a) whether assessments on linguistic and classroom proficiency criteria fit together to define a single underlying measurement trait b) whether ability estimates based on the classroom competence criteria match those derived from linguistic rating categories c)whether the two sets of criteria together produce consistent estimates of candidate ability.

Results

Results show that linguistic and other occupation-specific factors are only partly related and that each makes an independent contribution to estimates of candidate ability. The relatively large percentage of 'misfitting' ability estimates identified in the data output however suggests that at least for some candidates the combination of the two sets of criteria may be problematic. Explanations for this finding are offered and some practical solutions proposed.

Implications

The paper concludes with a discussion of the validity of applying 'real world' (occupation specific) criteria to the measurement of teacher competence in particular and to performance tests generally.

Test Answers as Indicators of Mental Model Construction

Claire Gordon and David Hanauer

The Open University of Israel

Purpose of Research

The purpose of this research is to investigate the interrelationship between mental model construction and test responses. The basic hypothesis of this study is that the reader's mental model continues to develop throughout the test-taking process. Thus, testing tasks have the potential of effecting the on-going construction of the test-taker's mental model.

Theoretical Background

Within the framework of this study the ability to comprehend a text is considered to be the ability to construct a mental model of the text within the cognitive domain of the individual (Gernsbacher 1990; Kintsch 1988; Flower 1987; Meutsch 1986; Johnson-Laird 1983). Current approaches to testing emphasize a more valid evaluation process which takes into account the process of comprehension and not solely the product of comprehension.

Research Design

In order to study the relationship between responses on reading comprehension tests in EFL in different formats and the test-taker's mental model construction, an exploratory qualitative study was designed. "Think aloud" data were obtained as subjects responded to both multiple choice and open ended comprehension test questions written in both L1 and L2, on an EFL reading text. The subjects were 28 10th grade high school students studying English as a foreign language.

Results

The analysis of the verbal protocols revealed that testing tasks function as an additional information source which interacts with the continuing development of the mental model.

References

Flower, L. 1987. Interpretive acts: Cognition and the construction of discourse. Poetics, 16, 109129.

Gernsbacher, M.A. 1990. Langauge Comprehension as Structure Building. New Jersey: Erlbaum.

Johnson-Laird, P.N. 1983. Mental Models. Cambridge: Harvard University Press.

Kintsch, W. 1988. The role of knowledge in discourse comprehension: A Construction-Integration model. Psychological Review, 95, 163182.

Meutsch, D. 1986. Mental models in literary discourse: Towards the integration of linguistic and psychological levels of description. Poetics, 15, 307331.

Cognitive Aspects of Performance and Assessment of the Meno Communicative Skills Components

Alison J.K. Green

University of Cambridge Local Examinations Syndicate

Theoretical background and rationale

Writing both requires thinking and is a vehicle for thinking. If an individual can write well on a particular topic, one would assume that that individual can also think effectively about that topic. Marking is also a cognitive skill, and it requires an examiner to think effectively.

Two writing tests currently being developed at UCLES are the focus for the studies reported here. The first (Task Directed Writing) requires students to write, selecting relevant information from that provided, about a given topic for a certain audience. The second (Critical Writing) requires students to read and critically evaluate an argument presented to them, and to generate further arguments of their own for or against the argument presented.

The theoretical framework

Writing is not a unitary skill - there are different types of writing, and different sets of cognitive processes configure in order to meet the demands of various writing tasks. To the extent that writing is a cognitive skill then, research on skilled performance in other domains should inform us on what skilled writers might do and on how writing skills develop. Similarly, examining is not a unitary skill - it requires processes of comprehension, interpretation and evaluation.

Purposes of the research

The aim of the first study was to evaluate the materials and ensure that the two components were distinct. It was also important to examine the influence of materials on performance for each component. The aim of the second study was to identify strategies examiners use and to develop a model of good marking practice. The principal aim of the third study was to examine simultaneously all sources of error in assigning grades.

Research design and methods

Ten TDW and six CW scripts were selected for the protocol exercise. These were distributed amongst the four examiners so that each examiner marked three CW scripts and five TDW scripts. Examiners were asked to "think aloud" as they marked the essays. One principal examiner marked all 10 TDW and all 6 CW scripts.

Sixteen Access students then completed three of each type of essay, giving 96 essays to be marked. The same group of markers then marked these essays. Each essay was marked three times. Again, one principal examiner marked all 96 essays.

Results

Preliminary results suggest that the materials demand different skills, and that the CW task tends to yield lower grades than the TDW task. Performance across the TDW tasks was more variable, suggesting a materials effect. Examiners exhibited difficulties in differentiating grades and in using the mark schemes. Verbal protocols suggest that the cognitive demands involved in using these complex mark schemes was one factor that served to reduce the reliability of grades.

Implications of the results

The two writing tasks imposed a different set of cognitive demands on both students and examiners. Accurate assessment of candidates' abilities appears to be complicated by effects of materials and by the complexity of the mark schemes. Some ways to overcome these problems will be suggested.

Applying Ethical Standards to Portfolio Assessment in ESL

Liz Hamp-Lyons

English Department, University of Colorado, Denver

Moss (1992) argues that divisions among assessment researchers and practitioners are based not on technical considerations but on social ones, on the relative weight we give to the consequences of our assessment practices for equitable teaching and learning. As the programs of the 1992 meetings of AERA and NCME show, performance assessment is the direction of the present and future. Given the importance of concerns about equal access to education, and about test bias, for the second language learning/testing context, it is surprising that language testing has been lagging behind. When ILTA, the International Language Testing Association, was formed in March 1992 in Vancouver, its declared purpose was "to promote the improvement of language testing throughout the world." Happily, the theme for LTRC in Cambridge, "Performance Testing," reflects that purpose, acknowledging that as language testers become increasingly concerned with accountability in addition to the more traditional concerns of placement, achievement and aptitude, we, like our education colleagues in most other fields, increasingly turn to performance assessment measures.

At the National Symposium on Assessment of Limited English Proficient Students held by the U.S. Dept. of Education, Baker (1991) cited developments in writing assessment, particularly portfolio assessment, as the best source of data currently available on the application of performance assessment with bilingual learners. At the same meeting I discussed considerations and methods for ethical assessment of ESL writing. In this paper I will turn to a consideration of ways in which we can ensure the ethical assessment of ESL writing using portfolio assessment methods.

I will consider Linn, Baker and Dunbar's (1991) criteria, expanding on them with criteria from Cronbach (1990) and Frederickson and Collins (1989), linking these with the notions of local knowledge (McKendy, 1992) and expert/non-expert judgment (Linacre, Engelhard, Tatum & Myford, in press; Lunz, Wright & Linacre, 1990; McNamara & Adams, 1991, Stahl & Lunz, 1992). I will apply these

ethical criteria to a data set of ESL portfolios from an operational portfolio assessment. I will illustrate the feasibility of developing models for ethical performance assessment for the ESL context. I will also evaluate results of the study, which generally support claims that portfolio assessment reduces test bias against nonnative speakers of the language in writing assessments, while revealing continuing cautions about claims of freedom from bias in portfolio or other performance assessments.

The Effect of Prior Knowledge on EAP Listening Test Performance

Christa Hansen and Christine Jensen

Applied English Center, University of Kansas

We have developed an English for academic purposes test of listening comprehension to be used for proficiency/placement decisions for university-level nonnative speakers of English.

The format of the test--situational context set before listening, a series of dialogues on a continuing theme, mini-lectures delivered by university professors, short-answer questions to be answered in real-time--is based on theoretical considerations of discourse comprehension (van Dijk and Kintsch, 1983) and second language listening comprehension (Voss, 1984; Buck, 1990). We use detail questions and global questions to assess listening comprehension.

The research question that we are exploring is whether prior knowledge of a topic is a significant factor in listening comprehension performance on lectures. We hypothesize that listening comprehension proficiency moderates or mediates the effect of prior knowledge on lecture performance: prior knowledge of a topic would not have a significant effect until listeners achieve a high level of proficiency in English.

We have piloted 4 versions of the test with 7 different lecture topics on a population of 125-250 listeners for each administration. We will be examining the effect of prior knowledge of topic, as reported by the listeners, on their performance on the lecture. To do so, we will conduct multiple regressions on the different lectures using prior knowledge of topic, listening proficiency, and the interaction between prior knowledge and listening proficiency as predictors of lecture outcome to see 1) whether prior knowledge affects lecture performance and 2) whether listening proficiency moderates the effect of prior knowledge on lecture performance. We will also run path analyses to determine if listening proficiency mediates the effect of prior knowledge on lecture performance.

Our expectation is that listeners will not be able to successfully access their prior knowledge of a topic to help them comprehend the lecture material until they reach high levels of listening proficiency.

References

Buck, G. 1990: The testing of second language listening comprehension. PhD. dissertation, University of Lancaster, England.

Van Dijk, T.A. and Kintsch, W. 1983: Strategies of discourse comprehension. New York: Academic Press.

Voss, B. 1984: Slips of the ear. Investigations into the speech perception behavior of German speakers of English. Tübingen: Narr.

Predicting the Comprehension of Prose Varying by Difficulty: An Application of the One Parameter Logistic Model

Frans G.M. Kleintjes, Gerrit Staphorsius, and Norman D. Verhelst

CITO, Dutch National Institute for Educational Measurement

Theoretical background and rationale

Research into the readability or comprehensibility of texts has a history of about 70 years. The object of most research workers in this field has been the development of an instrument to predict the readability of texts. For educational purposes it is important, that the readability of texts can be related to differences in reading ability, because reading ability of pupils differs widely within grades.

Purpose of the research

The development of a set of domain referenced instruments to measure reading ability of pupils in primary school and to predict the probability that they can comprehend a text of a certain difficulty.

Research design and method

The main part of the instruments consists of a system of 'text tests'. A test consists of three or four texts of about 1,000 words each. Every text covers five to eight items: paragraphs in which a part of a sentence has been deleted. Pupils are asked to complete the paragraphs, by choosing the right alternative out of five. The items have been constructed in such a way that assimilation of the context is both necessary and sufficient to give the right answer. The paper will focus on the validation of the instrument used and on the relation between readability and reading ability, applying the one parameter logistic model (OPLM), one of the models in Item Response Theory.

All 256 text test items in 42 texts were calibrated in an incomplete design, covering 20,000 records. OPLM combines the properties of the Rasch model and the advantages of two parameter IRT models: different discrimination values for each item are tested and item parameters are estimated by the conditional maximum likelihood method. Sample independence is preserved in this way.

Results

The application has led to an essential result considering the development of a set of domain referenced tests: an index for each text test item, that is based on a minimum reading ability required to accomplish the item. This index is not only an index for the reading ability but it is by property of IRT, an index for the difficulty or readability of the items too. It is shown that a valid and reliable prediction of "cloze-readability" of the items can be made, using their IRT indices.

Implications of the results

We can scale text test items according to readability. The domain referenced interpretation of the scores on the text tests has been built on this characteristic. We are able to measure the reading ability of pupils, enabling us to refer them to texts they are likely to comprehend.

References

Staphorsius, G. and Krom, R.S.H. 1985: Predictie van leesbaarheid [Prediction of readability]. Tijdschrift voor Taalbeheersing 7, 3, 192-211.

Verhelst, N.D., Glas, C.A.W. and Verstralen, H.H.F.M. 1993: OPLM, One Parameter Logistic Model. Computer Program and Manual. Arnhem: CITO.

Equating National Exams of Reading Comprehension in the Foreign Language

Rob van Krieken

CITO, Dutch National Institute for Educational Measurement

Background and purpose

The procedures for the construction of Dutch national exams of reading comprehension and for setting cut-off scores have remained roughly unchanged for over twenty years. Construction procedures are characterized by thorough screening rather than pretesting; cut-off scores are influenced by procedures and percentages fails rather than by equating. CITO carried out two studies to demonstrate the necessity and feasibility of equating procedures using IRT methodology.

Research design and methods

The first study equated exams from 1984 till 1990 with the same old exam using section post-equating and producing differences in mean difficulty. Traditional estimates and IRT estimates lead to the same cut-off scores.

The second study was part of a large scale project the Inspectorate had commissioned, investigating the equivalence of cut-off scores in 17 content subjects as well as differences in populations, using teachers' estimates and empirical data. Data were scaled using IRT methodology, producing estimates of the mean score candidates in 1991 would have got on previous exams and comparing these with the actual mean scores of previous populations.

Results

The first study demonstrated differences between exams and showed that the estimates were robust. The second study showed that teachers' estimates were consistent, but correlated only moderately with pupils' results. Here again cut-off scores differed.

Implications

About one out of every six previous cut-off scores in the second study turned out to be not equivalent to the most recent one. The population means varied too, so the distribution should not be taken as a starting point for setting the norm. Section post-equating seems to be an efficient way of equating. Acting upon the outcomes of the second study, the State Secretary for Education and Science has provided funds for introducing and maintaining equating as a standard procedure in central exams.

References

Glas, C.A.W. (1989). Contributions to Estimating and Testing Rasch models. PhD Thesis, Twente University

Krieken, R. van. (1990) Equating in the Dutch centralized examinations. Unpublished paper IAEA Maastricht.

Krieken, R. van. (1993) Ontwikkeling van examennormen. Verslag van een onderzoek t.b.v. de inspectie [Development of examination norms: Report on research commissioned by the Inspectorate]. CITO, Arnhem.

A Qualitative Approach to Monitoring Examiner Conduct in the Cambridge Assessment of Spoken English (CASE)

Anne Lazaraton

The Pennsylvania State University

Theoretical background and rationale

Recently, language testers have begun to look beyond traditional statistical approaches to estimating the reliability and validity of the measurement instruments employed. One promising avenue of inquiry is the use of qualitative methods to understand the nature of oral proficiency testing. For example, discourse analytic techniques have revealed the structure of and interaction in oral interviews, the types of language candidates produce in them, and the types of modifications that interviewers make to accommodate their interlocutors.

Purpose of the research

Along these lines, this paper describes how discourse analysis was used to understand intra-rater and inter-rater reliability in such a context. While the traditional concern has been pinpointing (and rectifying) inconsistency of final ratings across assessors and/or over time for one assessor, an implicit assumption on which these 'product' ratings are based is that assessor conduct is also consistent: we cannot ensure that all candidates are given the same number and kinds of opportunities to display their abilities unless examiners conduct themselves in similar, prescribed ways.

Method

To test this assumption, a study of the three-part Cambridge Assessment of Spoken English (CASE) procedure was undertaken. Audiotapes of 58 encounters conducted by 10 trained assessors were transcribed and analyzed for the interviewers' adherence to the prespecified CASE Interlocutor Frame, an agenda which prescribes the manner and order in which question and instructional prompts occur during the assessment.

Results

Results indicated significant variation in the wording of the prompts, in the degree to which particular prompts were used or avoided, and in the ways in which candidates responded to these modifications. Based on these findings, a template has been developed to provide feedback to assessors on how they might improve their interviewing techniques and to monitor their future performance.

Implications

Ultimately, this effort will assist CASE test developers and trainers in determining quickly and efficiently how consistently the assessors are performing during training, certification and 'live' encounters; it is hoped that this type of monitoring procedure can also be adapted for other face-to-face oral examinations. Such information, in conjunction with traditional 'product' ratings, can provide a more accurate estimate of reliability in oral assessment contexts.

Taking a Multi-Faceted View of the Unidimensional Measurement from Rash Analysis in Language Tests

Tony Lee

Centre for Applied Linguistics and Languages, Griffith University

Theoretical background and rationale

The growing popularity of Item Response Theory (IRT) and Rasch analysis in language testing has given the field a much needed impetus since the days of communicative testing. The rigour of IRT is in what Wright and Masters calls the "universal characteristic of all measurements" (Wright & Masters 1982, 2), ie. the requirement in any measurement that the things being measured should be assigned linear magnitudes along a single dimension which is stable. A condition many language testers would consider over-restrictive (eg. Bachman 1990, 265.) and many practitioners of IRT feel the need to justify (e.g. Henning et al. 1985). This paper aims to look into the question of dimensionality in IRT by employing multi-faceted Rasch analysis (Linacre & Wright 1990) in language testing.

Purpose of the research

The research was carried out to help decision making regarding ESL proficiency level for incoming undergraduate students at the end of an intensive pre-sessional ESL course. An ESL proficiency test was designed and administered to the students taking the course and to a reference group of ESL students who were known to be above the minimum ESL proficiency level required.

Research design and methods

Multi-facet Rasch analysis was applied. Two facets in the data were examined. The first had to do with the separation and the ordering of the student and the reference group. The second facet was the separation of the sub-tests in the test battery: grammatical accuracy and discourse sensitivity.

Results

The results of the analysis indicated that the reference group was significantly higher in terms of their logit standing than the student group and that the discourse sensitivity sub-test was significantly more difficult than the grammar accuracy sub-test.

Implications

The research is an interesting example of a possible extension of item response theory from a purely measurement model to a general research tool in applied linguistics.

References

Bachman, L.F. 1990. Fundamental Considerations in Language Testing. London: Oxford University Press.

Henning, G., Hudson, T. & Turner, J. 1985 Item response theory and the assumption of unidimensionality for language tests, Language Testing, 2, 141154.

Linacre, J.M. & Wright, B.D.1990 FACETS, MESA Press, Chicago.

Wright, B.D. & Masters, G.N. 1982 Rating Scale Analysis, Mesa Press, Chicago.

Linguistic Accuracy versus Coherence in Assessing Examination Answers in Content Subjects

Yasmeen Lukmani

University of Bombay / Research Centre for English & Applied Linguistics, University of Cambridge

Theoretical background and rationale

The role of English in academic success is the wider context of this study and one aspect of this is explored here. Academic success is considered in terms of the assessment of student performance in the written examination in content subjects, and the relative roles of linguistic accuracy and coherence in such assessment is analyzed.

Purposes of the research

The specific questions addressed here are:

- how much do well-formed sentences contribute to academic achievement in content subjects in the written examination?

- how much in comparison, do coherent texts contribute to it?

Research design and methods

In order to determine this, student answer scripts at freshman level in three subjects, namely, Economics, Logic and Zoology, at the University of Bombay, were selected and edited to provide four different versions which were then assessed by subject teachers, English teachers and native speakers, as independent scripts. Statistical and descriptive methods of analysis were employed.

Results

The experiment provides insights into the variable requirements for coherence and grammaticality in the different subject areas as well as the variations in demand for these features by the different assessor groups. All assessors value the `ideal' version, edited for coherence and grammar highest. However, when these two features occur in isolation, subject teachers in Economics find grammatical correctness of greatest importance, whereas Zoology teachers value coherence more, while Logic teachers value the two only in conjunction. In spite of field-specific differences, subject teachers, in general, rate coherent scripts higher than grammatically correct scripts, while the opposite is true for English teachers, though they also indicate the importance of coherence.

Implications of the results

It is suggested that English teachers incorporate more coherence-developing activities into their classes, considering the importance they themselves assign to it. This would enable them to contribute substantially to the study of content subjects and thus realise more adequately the basic role of General English teaching.

Rater Characteristics and Rater Bias: Implications for Training

Tom Lumley and Tim McNamara

University of Melbourne

Theoretical background and rationale

Recent developments in multi-faceted Rasch measurement (Linacre, 1989) have made possible new kinds of investigation of aspects (or 'facets') of performance assessments. Relevant characteristics of such facets (for example, the relative harshness of individual raters) are modelled and reflected in the resulting person ability measures.

Bias analyses, that is, interactions between elements of any facet can also be analyzed (e.g., for the facet 'person', an element is an individual candidate). This permits investigation of the way a particular aspect of the test situation may elicit a consistently biased pattern of responses from a rater. Stahl and Lunz (1992) used these techniques to produce judge performance reports, which provide individual raters with information on their relative characteristics as raters, their consistency and any individual biased ratings, in a judge-mediated examination of histotechnology.

Purposes of the research

The purpose of the research is to investigate the use of these analytical techniques in rater training for the speaking sub-test of the Occupational English Test (OET), a specific purpose ESL performance tests for health professionals.

Research design and methods

The test involves a role-play based, profession-specific interaction. Data are presented from two rater training sessions (13 raters, 10 candidates) separated by an 18-month interval and intervening operational test administrations (6 of the above raters, 70 candidates).

Results

The analysis is used to establish consistency of rater characteristics over the 18-month period, using rater measures and bias analyses.

Implications of the results

The paper addresses the question of the stability of rater characteristics over rating occasions, which has practical implications in terms of the accreditation of raters and the requirements of data analysis following test administration sessions. It also reports on the use of feedback to raters of the results of this analysis as part of the rater training process. The paper also has research implications concerning the role of multi-faceted Rasch measurement in understanding rater behaviour in performance assessment contexts.

References

Linacre, J.M. 1989: Many-facet Rasch Measurement. Chicago: MESA Press.

Lunz, M.E. and J. A. Stahl. 1992: Judge Performance Reports: Media and Message. Presentation to American Educational Research Association, San Francisco, 1992.

Validating the Certificates of Foreign Language Proficiency: The usefulness of qualitative validation techniques

Sari Luoma

Language Centre for Finnish Universities

Background and rationale

The Certificates of Foreign Language Proficiency are general purpose tests of language use designed for the adult learner. The paper deals with the techniques of validation used when designing the test specifications and implementing these into the first versions of the tests. The data comes from the development of tests in one language, English, on two levels, Basic and Intermediate. Two versions of the Basic and three of the Intermediate test have been piloted with small samples. There is some quantitative data, but due to the kinds of information needed, qualitative validation had a greater role.

Purposes of the research

The practical purpose of the research was to validate the test specifications and tests before their official use. While doing this, several kinds of qualitative validation procedures were used. The purpose of this paper is to examine the relative usefulness of these techniques.

Research design and methods

The initial test specifications were presented to 15 teachers of adults for comments. Test specialists were interviewed on test versions. 117 testees participated in the tests. Two questionnaires, at the beginning and at the end, focused on the testees' past experience in instruction, testing and language use, and on self-evaluation. A short questionnaire followed each task. There were 27 post-test interviews, and on the Intermediate test, three think-aloud protocols. Some basic statistical operations could be used on the test scores; sample size made the use of advanced statistics impossible.

Results

Statistical operations were most useful for examining levels of difficulty, response analysis for finding flaws in task settings, questionnaires for getting general opinions on item types, and introspection and interviews for getting at processing and for interpreting all the testees' answers on the tasks and the questionnaires.

Implications of the results

All the validation data suggests that the operationalisation of the test specifications in reading and listening comprehension will have to be more precisely defined. Think-aloud protocols were extremely useful at this stage of test development, and some more of them are needed. Interviews, possibly in a modified version, could be suggested as a regular feedback system.

A Pilot Study into Task Difficulty Factors in a Test of English for Specific Purposes

Paul McCann and Alex Teasdale

British Council Madrid

Theoretical background and rationale

The research has as its focus specific factors which, it is posited, affect task difficulty in Listening in Air Traffic Control. Standard phraseology determines how such information is to be realised on radiotelephony frequencies. However, non-standard elements and deviations in phraseology, acoustic noise on radio frequencies and speaker speed and accent may increase communication difficulties in ATCO/pilot interactions.

Purposes of the research

The research arises out of project interest in factors which contribute to item difficulty. Two main areas were investigated:

- features which affect item difficulty;

- the accuracy of project team members' predictions of item difficulty.

Research designs and methods

The research design made use of item groups in which individual items focus on specific information in audio recordings of live Air Traffic Control transmissions. Project team members identified difficult and easy items and attempted to identify features in the audio recording which were responsible for the level of difficulty. The item groups were administered to practising ATCO's and trainees. They were asked to indicate how confident they were that they had responded correctly and to give reasons for the perceived ease or difficulty of items. Factors investigated include the presence of interference on the tape, violations of predictable discourse order, the number of items of information, speed of delivery and accent. And the effect on actual and perceived difficulty of items when the recording is played twice.

Results

The results present information on:

- fit between testers' predictions of item difficulty/ease and candidate performance after one and two administrations;

- fit between testers' predictions of the reasons for item difficulty/ease and candidate reports after first and second administration;

- the effects of allowing candidates to listen to an audio recording twice.

Implications of the results

The research addresses an issue which is common to all test development situations. Why are some items more difficult than others? In terms of the present project it is hoped that information from the research will contribute to greater explicitness in describing what is being tested and will allow for the manipulation of elements which are identified as contributing to task difficulty.

The Effect of Interlocutor and Assessment Mode Variables in Offshore Assessments of Speaking Skills in Occupational Settings

Tim McNamara and Tom Lumley

University of Melbourne

Theoretical background and rationale

The increasing demand for performance assessment of speaking skills in second languages has led to logistic complications, for example, the delivery of tests in offshore locations. One solution to the problem has been to train native speaker interlocutors to carry out a series of oral interactions with the candidate, with assessment from audio recordings of the test session postponed and conducted centrally by a small team of trained raters. This technique is currently used in two large scale occupationally related ESP tests administered internationally on behalf of the Australian Government. But these procedures raise questions about the effect of such facets of the assessment situation as interlocutor variables and the quality of the audiotape recording. Recent developments in multi-faceted Rasch measurement (Linacre, 1989) have significantly broadened the possibilities for investigation of these issues.

Purposes of the research

The research presented in this paper investigates potential problems associated with the above approach to the offshore testing of speaking skills.

Research design and methods

Data from audiotape-based assessments of approximately 70 offshore candidates from two administrations of the Occupational English Test, an advanced level ESP test for health professionals, are considered. In addition to multiple ratings of candidate performance, each recording is rated for perceptions of the competence of the interlocutor, the rapport established between the candidate and the interlocutor, and the audibility of the interaction. These aspects of the assessment situation are treated as facets in a multi-faceted Rasch analysis of the data.

Results

The results of the analysis reveal the effects of interlocutor variability and audiotape quality on ratings.

Implications of the results

The paper concludes with an evaluation of the overall feasibility of the procedure, and implications for test administration arrangements are considered. The study is also a further demonstration of the application of multi-faceted Rasch measurement in performance assessment settings.

Reference

Linacre, J.M. 1989: Many-facet Rasch Measurement. Chicago: MESA Press.

A Study of the Decision-Making Behaviour of Composition

Michael Milanovic, Nick Saville, and Shen Shu Hong

University of Cambridge Local Examinations Syndicate

Theoretical Background and Rationale

An investigation of the markers' thought processes in the marking of examination compositions is an important issue in the assessment of L2 writing. Inter-rater reliability addresses the question of consistency, yet very little is known about the decision-making processes (or strategies) which are employed by the markers in making the assessment. Lack of knowledge in this area makes it more difficult to train markers to make valid and reliable assessments.

Purposes of the Research

An initial study was designed and carried out to explore the thought processes of examiners for Cambridge EFL compositions. In designing the study, a tentative model of examiner decision-making behaviour was developed based on the findings of recent research (Cumming, 1990, Vaughan, 1991) and the authors' own work and experience in this area.

The study investigated: (1) markers' decision-making processes while evaluating EFL compositions, (2) the effect of marking speed on the decision-making behaviour, (3) the relationship between markers' background and their thought processes, and (4) the effect of the proficiency level of the scripts on the decision-making behaviour.

Research Design

Sixteen markers with 4 different experiential backgrounds participated in the study, (a) experienced markers of EFL examination compositions, (b) inexperienced markers of EFL examination compositions, (c) inexperienced EFL teachers, and (d) experienced teachers of English as a mother-tongue.

The marking processes of the examiners were reported and recorded by means of: (a) retrospective written protocols, (b) speak-aloud introspection while marking, (c) a questionnaire, and (d) an interview.

Results

The relationship between markers' thought processes and marking speed, background characteristics, and the proficiency level of the scripts were examined. Four general approaches to marking were identified as a result of the analysis. In addition, it was found that eleven composition elements were focused on by the examiners.

Implications of the Results

Follow-up studies are currently being carried out to confirm the findings and an attempt will be made to establish a full model of the behaviour. The paper will report on the work carried out to date and also on the follow-up work which will be completed by mid-1993.

References

Cumming, A. 1990: Expertise in Evaluating Second Language Compositions. Language Testing 7, 31-51.

Vaughan, C. 1991: Holistic Assessment: What Goes on in the Rater's Mind? In Hamp-Lyons, L. ed. Second Language Writing in Academic Contexts, 35, 400409.

Towards a Checklist for Computer Aided Language Testing (CALT)

José Noijons

CITO, Dutch National Institute for Educational Measurement

With the arrival of CALL-material that includes tests and exercises it appears that conventional checklists used to ascertain the merits of particular tests and exercises are insufficient and that a special adaptation for CALL would have to be made. This seems all the more relevant as much CALL-material looks attractive enough but it is clearly lacking in validatory terms: the possibilities of the computer and the inventiveness of the programmers mainly determine the format of tests and exercises, causing possible harm to a fair assessment of pupils' language abilities.

In this workshop a new type of checklist specially designed for CALL/CALT will be presented. It has the following format: a division is made between test content and test administration, before, during and after the actual test taking. It is hoped that such checklists may give teachers some support toward developing good classroom tests and help them recognize valid and reliable exercise/testing modules in existing CALL/CALT.

The relevance and practicability of the new checklist can be tested in this workshop. Participants will be able to work with all sorts of CALL/CALT material and decide for themselves whether this material meets with the standards that tend to be set for 'traditional' tests.

The Assessment of Writing by English and ESL Teachers

Kieran O'Loughlin

NLLIA Language Testing Centre, University of Melbourne

Theoretical background and rationale

This paper aims to investigate whether secondary teachers of English as a mother tongue and of English as a second language (abbreviated as "English" and "ESL" respectively) assess writing differently. Previous studies, based solely on the use of holistic (i.e. global) scoring methods, have consistently found no significant difference between the essay ratings of these teachers.

Purposes of the research

The current research project used both holistic and analytical scoring procedures which made it possible to compare, firstly, the global essay ratings and, secondly, the essay totals (calculated by combining the scores on five analytical categories i.e. Arguments and evidence, Organisation, Appropriateness of language, Grammar and cohesion and Spelling and punctuation and the global category) of English and ESL teachers.

Research design and methodology

Ratings given by four English and four ESL teachers to the same set of twenty native speaker ("English") and twenty non-native speaker ("ESL") essays were collected as data. This data was analyzed using both the Intra-class and Pearson correlations as well as t-tests (dependent samples).

Results

While no significant difference was found between the global essay ratings of the two rater groups, a comparison of the average total scores indicated that, overall, English teachers rated all of the essays significantly more harshly than ESL teachers. This was also true for ESL essays and almost for English essays taken separately. The main categories contributing towards this effect were found to be Appropriateness, Grammar and cohesion and Spelling and punctuation. Organisation was also a contributing factor, but not for English essays considered separately.

Implications of the results

These findings suggested that the analytical scoring method may be more faithful to real dissimilarities which exist between raters of different backgrounds and professional experience than the holistic scoring method in the assessment of writing.

Reporting Reading Test Results in Grades

Alastair Pollitt

Research Centre for English and Applied Linguistics,

University of Cambridge

Purposes of the research

Language comprehension is an invisible process, and therefore difficult to describe, but grade related assessment is popular with test users - including teachers and students. The study was an exercise in trying to develop a rigorous basis for reporting comprehension ability through a set of graded descriptors.

Theoretical background and rationale

Accurate description of the comprehension skills used in testing requires that the test activity be a purpose-directed use of language. Summary completion was developed as a testing technique within a teleological structure.

Research design and methods

Using data from a national monitoring survey of English, the demands of test questions were analyzed, and the order of difficulty of the questions used to determine grades of comprehension. Task specific descriptors were generated from the analysis of questions.

Results

Useful grades for reporting survey results to teachers and the public were established. It was not possible to generalise these across tasks with confidence.

Implications of the results

The importance of text and reading purpose in reading assessment cannot be overestimated. We are still a long way from being able to produce generalisable grade descriptors of the ability to comprehend language.

What Raters Really Pay Attention to

Alastair Pollitt and Neil L. Murray

Research Centre for English and Applied Linguistics

University of Cambridge

Purposes of the research

'Assessor oriented' rating scales are really 'diagnosis-oriented'. A rating scale truly orientated to helping the assessor would include only the aspects of oral performance that people actually use in making their judgements about speakers' proficiency, rather than aiming for a logically consistent grid of theoretical descriptors. This study evaluates a methodology for exploring raters' perceptions of oral performances when they are asked to judge them for evidence of proficiency.

Theoretical background and rationale

Two theoretical positions are combined: Thurstone's paired comparison technique for accurate measurement of psychological dimensions provides the quantitative basis for a scale of proficiency, while Kelly's technique for the elicitation of personal constructs provides the qualitative information about how the judges perceived the performances. The two combine to give meaning to the scale.

Research design and methods

Video recordings of CPE students were presented, in pairs, to judges, who (a) nominated the 'more proficient', (b) compared the two performances as if in a Kelly therapy session. Recordings of these sessions were analyzed to extract the personal construct systems being used.

Results

The two techniques worked well together. Considerable, but not complete, consistency was observed, and explained. It is clear that the dimension of oral proficiency, as perceived by these judges at least, is not composed of the same traits at different levels. At higher levels, the most salient features of the performances concerned native-like behaviour and emphasis on content, while at lower levels grammatical form and hesitancy were more important criteria for making decisions.

Implications of the results

Raters will always find it difficult to judge standards of proficiency using scales that contain irrelevant detail, and consist of unnecessary components. Personal Construct Theory, supported by Paired Comparison Scaling provides a powerful tool for investigating the behaviour of raters.

Performance Conditions: Stability of Effects Across Cultural Groups and Task Types in the Assessment of Spoken Language

Don Porter

CALS, University of Reading

Theoretical background and rationale

A number of frameworks have been proposed over the years for describing the various features of testing methods which might have an effect on the linguistic performance of the testee (e.g. Carroll, 1968; Bachman, 1990; Weir and Bygate, 1990). Features may include aspects which relate purely to the fact that language is being tested (e.g. test instructions, explicitness of criteria for assessment), while others refer to features in the test which appear to be similar to those occurring in natural untested use "Performance conditions". Most frameworks for describing performance conditions suffer from being under-researched, in that amongst other things the strength and stability of the effects of various proposed performance conditions are not known.

Purpose of the research

To shed light on whether specified performance conditions actually do have a significant effect in terms of enhancement or weakening in language learners' performance in spoken language, and on whether any significant effect observed is stable across different cultural groups and different tasks. This study will focus on age-differences between speakers, as it has been suggested that this condition in particular will have a marked effect on the spoken language performance of Japanese learners.

Research and design methods

Learners of English from two contrasting cultural backgrounds (Japanese [n = 32] and Arabs [n = 32]) will each undertake two oral interaction tasks (problem- solving in pairs; interview). Each of the two tasks will be performed under a distinct performance condition (± clear difference in age between speakers). The task-condition relationship will be systematically varied, and the study will control for gender and proficiency level of participants.

All interactions will be recorded on video and assessed by multiple raters on analytic and holistic rating scales. In addition, some detailed analysis of language produced under the two conditions will be undertaken, drawing on Bygate (1987).

Results

This experiment has just been started, so results are not yet available.

Implications

Insofar as the effects of such performance conditions can be shown to be stable across tasks and cultural groupings, systematic account will need to be taken of them in the design of test tasks.

The Development of a New Measure of L2 Vocabulary Knowledge

John Read

Victoria University of Wellington

The design of any vocabulary test represents a compromise among several competing requirements. It is normally desirable to include a reasonable number of words in order to sample adequately the type of vocabulary to be assessed. However, this usually means that knowledge of individual words is assessed in a minimal fashion, which does not reflect how well each word is known. Psycholinguistic research on word association suggests the basis for a new type of test format that can assess depth of knowledge of words in an economical way. This "word associates" format requires test-takers to select from a list words that are semantically associated with a particular stimulus word.

The purpose of the research, then, was to investigate the validity of a test employing the word associates format as a measure of the academic vocabulary knowledge for learners of English for academic purposes (EAP). After preliminary tryouts and revisions, a formal trial of the test was conducted towards the end of an EAP proficiency course. In addition to the test scores, think-aloud protocols were obtained from individual test-takers and a concurrent measure of vocabulary knowledge was administered. The Rasch Partial Credit Model was used to analyze the test score data, while the other data provided further elucidation of the results.

The results showed that the test is reliable and has a reasonable level of concurrent validity. However, the Rasch analysis also identified a higher than acceptable number of misfitting persons. This and the protocol analysis suggest that test-taker behaviour in particular, a willingness to guess - plays a significant role in test performance that requires further investigation.

Whatever the ultimate practical value of this particular test, the word associates format has potential as a tool in research on L2 vocabulary acquisition. However, tests designed for this purpose will require items that are less heterogeneous in structure than those in the present test, so that it is possible to define more specifically what kind of lexical knowledge is required for successful performance.

Procedures for Improving the Predictive Value of Indirect Tests: Study Based on Scores from the OPI and TWE, and an Item-by-Item Analysis of a Multiple-Choice Examination

Daniel J. Reed

Indiana University

Theoretical background and rationale

Discrete-point, multiple-choice type tests such as the Test of English as a Foreign Language (TOEFL) have been criticized both for having low face validity and for lacking predictive validity. Moreover, the chances of developing valid, indirect tests that reliably predict performance on direct tests would seem to be poor, given the typically low correlations reported in studies that have compared the two approaches. However, the research primarily has been based on traditional methods of marking items and reporting scores. The idea of implementing alternative strategies has not been thoroughly investigated, but could potentially yield different results.

Purposes of the research

The aim of this study was to test the possibility that the predictive value of indirect tests could be increased by orienting the selection and scoring of items toward the "external" criterion "level of proficiency".

Research design and methods

An investigation was conducted in which 69 adult learners were administered three tests: the OPI, the TWE, and a multiple-choice test that correlates very strongly with the TOEFL, and is very similar in format. This made it possible to analyze responses on the multiple-choice test, item by item, in relation to performance on the two direct tests, which in turn contributed to an assessment of the "external consistency" of the test. The procedures examined included weighting items with item discrimination indexes based on upper and lower proficiency groups, and then with coefficients generated by the statistical technique, discriminant analysis.

Results

Predictions based on the alternative scoring procedures were better than predictions made based on traditional scoring procedures when scoring schemes were tested on the same cases from which they were derived. However, the results of preliminary tests on small groups of independent cases were inconclusive. Modifications of this study that might help to resolve this issue are outlined.

Implications

Although current thought on the nature of language and communicative