Airasian, P. W. (1988). Measurement driven instruction: A closer look. Educational Measurement: Issues and Practice, 7(4), 6-11.
Andrich, D. (1988) Rasch Models for Measurement. Sage Publications, Inc., Beverly Hills.
Andrich, D. (2004). Controversy and the Rash model: A characteristic of incompatible paradigm? Medical Care, 42(I), 1–16.
Andrich, D. (2010). Understanding the response structure and process in the polytomous Rasch model. In M. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 123– 152). New York, NY: Routledge.
American Educational Research Association, American Psychological
Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington DC.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
Bachman, L. F., & Eignor, D. R. (1997). Recent advances in Quantitative test analysis. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education, language testing and assessment, Vol. 7, (pp. 227–242). Dordrecht: Kluwer Academic.
Bachman, L. F. & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use. Oxford: Oxford University Press.
Baghaei Moghadam,P. (2009). Understanding the Rasch model. Mashhad, Sokhangostar Publications.
Baker, F. B. (1977). Advances in item analysis. Review of Educational Research, 47, 151- 158.
Baker, F. B. (1985). The basics of item response theory. Portsmouth, NH: Heinemann.
Baker, F. B. (1989). Computer technology in test construction and processing. In R. L. Linn (Ed.), Educational measurement (pp. 409–428). Macmillan Publishing.
Baker, F. B., & Kim, S. H. (2017). The basics of item response theory using R. Berlin: Springer.
Boopathiraj, C., Chellamani, K. (2013). Analysis of test items on difficulty level and discrimination index in the test for research in education. International Journal of Social Science & Interdisciplinary Research, (2), 189-193.Available at indianresearchjournals.com.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment. New York, NY: McGraw- Hill.
Brown, J. D. (2012). Classical test theory. In G. Fulcher and F. Davidson (Eds.), The Routledge handbook of language testing, (pp.323-335). Routledge.
Brown, J. D. (2013). Classical theory reliability. In A. Kunnan (Ed.), Companion to language assessment, Vol. 3. Hoboken, NJ: Wiley Blackwell.
Bryson, M. (1974). Heavy-tailed distributions: Properties and tests. Technometrics,6,61-68. http://dx.doi.org/10.1080/00401706.1974.10489150
Bulut, O. (2015). Applying item response theory models to entrance examination for graduate studies: Practical issues and insights. Journal of measurement and evaluation in education and psychology, 6(2): 313-330.
Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows (Computer software).Lincolnwood, IL: Scientific Software International.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Sage Publications.
Carlson,J.E & D avier,M.V. (2013). Item response theory. Educational Testing Service, Princeton, New Jersey.
Cohen, A. D. (1980). Testing Language Ability in the Classroom. Rowley, Mass: Newbury House Publishers.
Crocker, L. and Algina, J. (1986). Introduction to Classical and Modern Test Theory. Harcourt, New York.
Deville, C., & Chalhoub-Deville, M. (1993). Modified scoring, traditional item analysis, and Sato’s caution index used to investigate the reading recall protocol. Language Testing, (10), 117-132.
Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of test development. Lawrence Erlbaum Associates Publishers.
Edelen, M.O. & Reeve, B.B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research 16(Suppl 1), 5-18.
Fallahian, E. & Tabatabaei, O. (2015). Construct validity of MSRT reading comprehension module in Iranian context. English Language Teaching, 8(9), 173-186.
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/response statistics, Educational and Psychological Measurement, 58(3), 357- 381.
Farhady, H. & Hedayati, H. (2009). Language assessment policy in Iran. Annual Review of Applied Linguistics (29), 132-141.
Farhady, H. Jafarpur, A. and Birjandi, P.(1994). Language skills testing from theory to practice. Tehran: SAMT Publications.
Frey, B. B. (Ed.). (2018).The sage encyclopedia of educational research, measurement, and evaluation. Sage Publications.
Geranpayeh, A. (1994) Are Score Comparisons across Language Proficiency Test Batteries Justified? An IELTS-TOEFL Comparability Study, Edinburgh Working Papers in Applied Linguistics, 5: 50-65.
Gilbert, S. & Newtton, W. J.(1997). Principles of educational and psychological measurement and evaluation. Wadsworth: The University of California.
Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan.
Green, R. (2019). Item analysis in language assessment. In V. Aryadoost, & M. Raquel (Eds.). Quantitative data analysis for language assessment volume I: Fundamental techniques. Routledge.
Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Haladyna, T. M. (2016). Item analysis for selected response items. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of Test Development (2nd ed), (pp. 392–407). New York, NY: Routledge.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27.
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.
Hambleton, R. K. & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer Academic Publishers.
Henning, G. (1984). Advantages of latent trait measurement in language testing. Language Testing (1), 123–133.
Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cambridge: Newbury House Publishers.
Henning, G., Hudson, T. and Turner, J. (1985). Item response theory and the assumption of unidimensionality for language tests. Language Testing (2), 141–154.
Janssen, G., Meier, V., Trace, J. (2014). Classical test theory and item response theory: Two understandings of one high-stakes performance exam. Colombian Applied. Linguistics Journal. 16 (2), 167–184.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, (112), 527-535.
Kiani,G.R. & Haghighi, M.(2006). The investigation of the TMU English proficiency test: Reliability related issues. Journal of Humanities (16), 55-73.
Kline, T.J.B. (2005). Psychological Testing: A Practical Approach to Design and Evaluation. Sage Publications.
Kohli, N., Koran, J. & Henn. L. (2015). Relationships among classical test theory and item response theory frameworks via factor analytic models. Educational and Psychological Measurement, 75(3), 389-405.
Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328.
Linacre, J. M. (2015). A user’s guide to WINSTEPS MINISTEP Rasch-model computer programs. Chicago, IL: Winsteps.com.
Loe, A. (2021). Intro to IRT. Available at https:// aidenloe.github.io.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Malec,W. & Krzemińska-Adamek, A. (2020). A practical comparison of selected methods of evaluating multiple-choice options through classical item analysis. Practical Assessment, Research, and Evaluation: Vol.25, Article 7. Retrieved from https://scholarworks.umass.edu/pare/vol25/iss1/7
Mehrens, W. A., & Lehmann, I.J. (1991). Measurement and evaluation in education and psychology (4th ed). Belmont, CA: Wadsworth.Thomson Learning.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational Measurement, (3rd ed. pp. 13-103). New York: American Council of Education and Macmillan Publishing Company.
Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics: Application of item response theory models in personality research. In R.W. Robins, R.C. Fraley, & R.F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 407-423). New York: Guilford.
Moses, T. (2017). A review of developments and applications in item analysis. In R. Bennett & M. von Davier (Eds.), Advancing human assessment. The methodological, psychological and policy contributions of ETS (pp. 19–46). Springer Open. http://dx.doi.org/10.1007/978-3-319- 58689-2_2.
Mousavi, A. (2009).An encyclopedic dictionary of language testing. Rahnama Press, Tehran.
Nguyen, T. H., Han, H. R., Kim, M.T. & Chan, K.S.(2014).An introduction to item response theory for patient-reported outcome measurement. Patient, (7), 23-35. Springer. https://doi.org/10.1007/s40271-013-0041-0
Noori, M. & Hosseini Zadeh, S. (2017). The English Proficiency Test of the Iranian Ministry of Science, Research, and Technology: A Review. International Journal of English Language & Translation Studies. 5(3). 21-26.
Osterlind, S.J. (1983). Test item bias. Beverly Hills: Sage Publications.
Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed response, performance, and other formats (2nd ed.). Boston, MA: Kluwer Academic.
Ockey, G.J. (2012). Item response theory. In G. Fulcher and F. Davidson (Eds.), The Routledge handbook of language testing, (pp.336-345). Routledge.
Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educational leaders. Boston, MA: Allyn & Bacon.
Rizopoulos, D. (2018). ltm: An R package for latent trait models under IRT. Retrieved from https://github.com/drizopoulos/ltm.
Robitzsch, A. (2019).
sirt: Supplementary item response theory models. R package version 3.4-64. Retrieved from
https://CRAN.R-project.org/package=sirt
Sahrai, R. & Mamagani , H. (2013). The assessment of the reliability and validity of the MSRT proficiency test. The Educational Assessment Journal, 10(3), 1-19 [In Persian].
Salehi, M. (2011). On the construct validity of the reading section of the University of Tehran English Proficiency Test. Journal of English Language Teaching and Learning, (222), 129-159.
Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), The companion to language assessment. Vol. 3. Hoboken, NJ: Wiley Blackwell.
Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of research in education, 19 (pp. 405-450). Washington, DC: American Educational Research Association.
Traub, R. E. (1997). Classical test theory in historical perspective. Educational Measurement Issues and Practice (16), 8–14.
Tsutakawa, R. k, & Johnson, J.C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, (55), 371-390.
Wiersma, W., & Jurs, S. (1990). Educational measurement and testing. Needham Heights, MA: Allyn and Bacon.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press.
Yu, C. H. (2010). A simple guide to the item response theory (IRT) and Rasch modeling. Retrieved from http://www.creative-wisdom.com.
Zimowski, M., Muraki, E., Mislevy, R. J., Bock, R. D. (2002). BILOG-MG [Computer software]. Lincolnwood, IL: Scientific Software International.