Document Type : Research Paper


1 English Language and Literature Department, University of Isfahan, Isfahan, Iran

2 Applied Linguistics Department, University of Isfahan, Isfahan, Iran


Perhaps the degree of test difficulty is one of the most significant characteristics of a test. However, no empirical research on the difficulty of the MSRT test has been carried out. The current study attempts to fill the gap by utilizing a two-parameter item response model to investigate the psychometric properties (item difficulty and item discrimination) of the MSRT test. The Test Information Function (TIF) was also figured out to estimate how well the test at what range of ability distinguishes respondents. To this end, 328 graduate students (39.9% men and 60.1% women) were selected randomly from three universities in Isfahan. A version of MSRT English proficiency test was administered to the participants. The results supported the unidimensionality of the components of MSRT test. Analysis of difficulty and discrimination indices of the total test revealed that 14% of the test items were either easy / very easy, 38% were medium, and 48% were either difficult or very difficult. In addition, 14% of the total items were classified as nonfunctioning. They discriminated negatively or did not discriminate at all. 7% of the total items discriminated poorly, 17% discriminated moderately, and 62% discriminated either highly or perfectly, however they differentiated between high-ability and higher-ability test takers. Thus, 38% of the items displayed satisfactory difficulty. Too easy (14%) and too difficult (48%) items could be one potential reason why some items have low discriminating power. An auxiliary inspection of items by the MSRT test developers is indispensable.


Main Subjects

Airasian, P. W. (1988). Measurement driven instruction: A closer    look.  Educational Measurement: Issues and Practice, 7(4), 6-11.
Andrich, D. (1988) Rasch Models for Measurement. Sage  Publications,  Inc., Beverly Hills.
Andrich, D. (2004). Controversy and the Rash model: A characteristic   of incompatible  paradigm? Medical Care, 42(I), 1–16.
Andrich, D. (2010). Understanding the response structure and process in the polytomous  Rasch model. In M.  Nering & R. Ostini (Eds.),    Handbook of polytomous item response theory models (pp. 123– 152).   New York, NY: Routledge.
American Educational Research Association, American Psychological
Association, & National   Council    on     Measurement in Education      (1999). Standards    for    educational   and    psychological   testing.   Washington DC.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
Bachman,     L.   F., &   Eignor, D. R.  (1997). Recent    advances in     Quantitative test analysis.  In   C.  Clapham          &       D. Corson   (Eds.),   Encyclopedia   of   language   and   education,   language testing and assessment, Vol. 7, (pp. 227–242). Dordrecht: Kluwer Academic.
Bachman, L.  F. &    Palmer, A. S.  (2010).     Language assessment      in practice:  Developing    language assessments and   justifying their use. Oxford: Oxford University Press.
Baghaei Moghadam,P. (2009). Understanding the Rasch model.   Mashhad, Sokhangostar Publications.
Baker, F. B. (1977). Advances in item analysis. Review of Educational  Research, 47, 151- 158.
Baker, F. B. (1985). The basics of item response theory. Portsmouth,   NH: Heinemann.
Baker, F. B. (1989). Computer technology in test construction and   processing. In R. L. Linn (Ed.), Educational measurement (pp. 409–428). Macmillan Publishing.
Baker, F. B., & Kim, S. H. (2017). The basics of item response theory   using R. Berlin: Springer.
Boopathiraj, C., Chellamani, K. (2013). Analysis of test items on difficulty level and discrimination index in the test for research in education. International Journal of Social Science & Interdisciplinary Research, (2), 189-193.Available at
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English   language assessment. New York, NY: McGraw-        Hill.
Brown, J. D. (2012). Classical test theory. In G. Fulcher and F.  Davidson (Eds.), The Routledge handbook of language testing, (pp.323-335). Routledge.
Brown, J. D. (2013). Classical theory reliability. In A. Kunnan (Ed.), Companion to language assessment, Vol. 3. Hoboken, NJ: Wiley Blackwell.
Bryson, M. (1974). Heavy-tailed distributions: Properties and tests.  Technometrics,6,61-68.
Bulut, O. (2015). Applying item response theory models to entrance examination for graduate studies: Practical issues and insights. Journal of measurement and evaluation in education and      psychology, 6(2): 313-330.
Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows (Computer software).Lincolnwood, IL: Scientific Software  International.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased  test items. Sage Publications.
Carlson,J.E & D avier,M.V. (2013). Item response theory. Educational   Testing Service, Princeton, New Jersey.
Cohen, A. D. (1980). Testing Language Ability in the Classroom.  Rowley, Mass: Newbury House Publishers.
Crocker, L. and Algina, J. (1986). Introduction to Classical and Modern   Test Theory.  Harcourt, New York.
Deville, C., & Chalhoub-Deville, M.  (1993). Modified scoring,  traditional item analysis, and  Sato’s caution index used to investigate the reading recall protocol.  Language Testing,  (10), 117-132.
Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of test development. Lawrence Erlbaum Associates Publishers.
Edelen, M.O. & Reeve, B.B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research 16(Suppl 1), 5-18.
Fallahian, E. & Tabatabaei, O. (2015). Construct validity of MSRT reading comprehension   module in Iranian context. English Language Teaching, 8(9), 173-186.
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/response statistics, Educational and Psychological Measurement, 58(3), 357- 381.
Farhady, H. & Hedayati, H. (2009). Language assessment policy in   Iran.  Annual Review of   Applied Linguistics (29), 132-141.
Farhady, H. Jafarpur, A. and Birjandi, P.(1994). Language skills   testing from theory   to practice. Tehran: SAMT Publications.
Frey, B. B. (Ed.). (2018).The sage encyclopedia of educational     researchmeasurement, and   evaluation. Sage Publications.
Geranpayeh, A. (1994) Are Score Comparisons across Language Proficiency Test Batteries Justified? An IELTS-TOEFL Comparability Study, Edinburgh Working Papers in Applied  Linguistics, 5: 50-65.
Gilbert, S. & Newtton, W. J.(1997). Principles of educational and psychological measurement and evaluation. Wadsworth:  The   University of California.
Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan.
Green, R. (2019).  Item analysis in language assessment. In V. Aryadoost, & M. Raquel (Eds.). Quantitative data analysis for     language assessment volume I: Fundamental techniques.  Routledge.
Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.).  Mahwah, NJ: Lawrence Erlbaum.
Haladyna, T. M. (2016). Item analysis for selected response items. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of Test Development (2nd ed), (pp. 392–407).   New York, NY: Routledge.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27.
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.
Hambleton, R. K. & Swaminathan, H. (1985). Item response      theory: Principles and applications. Boston: Kluwer Academic      Publishers.
Henning, G. (1984). Advantages of latent trait measurement in       language testing. Language Testing (1), 123–133.
Henning, G. (1987). A guide to language testing: Development,     evaluation, research.   Cambridge: Newbury House Publishers.
Henning, G., Hudson, T. and Turner, J. (1985). Item response theory and the assumption of   unidimensionality for language tests. Language Testing (2), 141–154.
Janssen, G., Meier, V., Trace, J. (2014). Classical test theory and item response theory: Two understandings of one high-stakes      performance exam. Colombian Applied. Linguistics Journal. 16 (2),    167–184.
Kane, M. (1992). An argument-based approach to validity.     Psychological Bulletin, (112), 527-535.
Kiani,G.R. & Haghighi, M.(2006). The investigation of the TMU English proficiency test: Reliability related issues. Journal of Humanities (16), 55-73. 
Kline, T.J.B. (2005). Psychological Testing: A Practical Approach to Design and Evaluation. Sage Publications.
Kohli, N., Koran, J. & Henn. L. (2015). Relationships among classical test theory and item response theory frameworks via factor analytic models. Educational and Psychological    Measurement, 75(3), 389-405.
Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328.
Linacre, J. M. (2015). A user’s guide to WINSTEPS MINISTEP Rasch-model computer   programs. Chicago, IL:
Loe, A. (2021). Intro to IRT. Available at https://
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Malec,W. & Krzemińska-Adamek, A. (2020). A practical comparison of selected methods of evaluating multiple-choice options through classical item analysis.  Practical Assessment, Research, and Evaluation: Vol.25, Article 7. Retrieved from
Mehrens, W. A., & Lehmann, I.J. (1991). Measurement and evaluation in education and psychology (4th ed). Belmont, CA: Wadsworth.Thomson Learning.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational Measurement, (3rd ed. pp.  13-103). New York: American Council of Education and Macmillan Publishing Company. 
Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics: Application of item response theory models in personality research. In R.W. Robins, R.C. Fraley, & R.F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 407-423). New York: Guilford.
Moses, T. (2017). A review of developments and applications in item analysis. In R. Bennett & M. von Davier (Eds.), Advancing human assessment. The methodological, psychological and policy contributions of ETS (pp. 19–46). Springer Open. 58689-2_2.
Mousavi, A. (2009).An encyclopedic dictionary of language testing.   Rahnama Press, Tehran.
Nguyen, T. H., Han, H. R., Kim, M.T. & Chan, K.S.(2014).An introduction to item response theory for patient-reported outcome measurement. Patient, (7), 23-35. Springer.
Noori, M. & Hosseini Zadeh, S. (2017). The English Proficiency Test of the Iranian Ministry of Science, Research, and Technology: A  Review. International Journal of English Language & Translation Studies. 5(3). 21-26.
Osterlind, S.J. (1983). Test item bias. Beverly Hills: Sage Publications.
Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed response, performance, and other formats (2nd ed.). Boston, MA: Kluwer Academic.
Ockey, G.J. (2012). Item response theory. In G. Fulcher and F.  Davidson (Eds.), The Routledge handbook of language testing,      (pp.336-345). Routledge.
Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educational leaders. Boston, MA: Allyn & Bacon.
Rizopoulos, D. (2018). ltm:  An R package  for latent trait models under IRT. Retrieved from
Robitzsch, A. (2019). sirt: Supplementary item response theory models. R package version 3.4-64. Retrieved from
Sahrai, R. & Mamagani , H. (2013). The assessment of the reliability and validity of the MSRT proficiency test. The Educational Assessment Journal, 10(3), 1-19 [In Persian].
Salehi, M. (2011). On the construct validity of the reading section of the University of Tehran     English Proficiency Test. Journal of English Language Teaching and Learning, (222), 129-159.
Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), The companion to language assessment. Vol. 3. Hoboken, NJ: Wiley Blackwell.
Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of   research in education, 19 (pp. 405-450). Washington, DC: American Educational Research Association.
Traub, R. E. (1997). Classical test theory in historical perspective.   Educational Measurement Issues and Practice (16), 8–14.
Tsutakawa, R. k, & Johnson, J.C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, (55), 371-390.
Wiersma, W., & Jurs, S. (1990). Educational measurement and testing.  Needham Heights,   MA: Allyn and Bacon.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL:  MESA Press.
Yu, C. H. (2010). A simple guide to the item response theory (IRT) and Rasch modeling.  Retrieved from
Zimowski, M., Muraki, E., Mislevy, R. J., Bock, R. D. (2002). BILOG-MG [Computer software]. Lincolnwood, IL: Scientific Software International.