Airasian, P. W. (1988). Measurement driven instruction: A closer    look.  Educational Measurement: Issues and Practice, 7(4), 6-11.
                                                                                                                Andrich, D. (1988) Rasch Models for Measurement. Sage  Publications,  Inc., Beverly Hills.
                                                                                                                Andrich, D. (2004). Controversy and the Rash model: A characteristic   of incompatible  paradigm? Medical Care, 42(I), 1–16.
                                                                                                                Andrich, D. (2010). Understanding the response structure and process in the polytomous  Rasch model. In M.  Nering & R. Ostini (Eds.),    Handbook of polytomous item response theory models (pp. 123– 152).   New York, NY: Routledge.
                                                                                                                American Educational Research Association, American Psychological
                                                                                                                Association, & National   Council    on     Measurement in Education      (1999). Standards    for    educational   and    psychological   testing.   Washington DC.
                                                                                                                Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
                                                                                                                Bachman,     L.   F., &   Eignor, D. R.  (1997). Recent    advances in     Quantitative test analysis.  In   C.  Clapham          &       D. Corson   (Eds.),   Encyclopedia   of   language   and   education,   language testing and assessment, Vol. 7, (pp. 227–242). Dordrecht: Kluwer Academic.
                                                                                                                Bachman, L.  F. &    Palmer, A. S.  (2010).     Language assessment      in practice:  Developing    language assessments and   justifying their use. Oxford: Oxford University Press.
                                                                                                                Baghaei Moghadam,P. (2009). Understanding the Rasch model.   Mashhad, Sokhangostar Publications.
                                                                                                                Baker, F. B. (1977). Advances in item analysis. Review of Educational  Research, 47, 151- 158.
                                                                                                                Baker, F. B. (1985). The basics of item response theory. Portsmouth,   NH: Heinemann.
                                                                                                                Baker, F. B. (1989). Computer technology in test construction and   processing. In R. L. Linn (Ed.), Educational measurement (pp. 409–428). Macmillan Publishing.
                                                                                                                Baker, F. B., & Kim, S. H. (2017). The basics of item response theory   using R. Berlin: Springer.
                                                                                                                Boopathiraj, C., Chellamani, K. (2013). Analysis of test items on difficulty level and discrimination index in the test for research in education. International Journal of Social Science & Interdisciplinary Research, (2), 189-193.Available at indianresearchjournals.com.
                                                                                                                Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English   language assessment. New York, NY: McGraw-        Hill.
                                                                                                                Brown, J. D. (2012). Classical test theory. In G. Fulcher and F.  Davidson (Eds.), The Routledge handbook of language testing, (pp.323-335). Routledge.
                                                                                                                Brown, J. D. (2013). Classical theory reliability. In A. Kunnan (Ed.), Companion to language assessment, Vol. 3. Hoboken, NJ: Wiley Blackwell.
                                                                                                                Bryson, M. (1974). Heavy-tailed distributions: Properties and tests.  Technometrics,6,61-68. http://dx.doi.org/10.1080/00401706.1974.10489150
                                                                                                                Bulut, O. (2015). Applying item response theory models to entrance examination for graduate studies: Practical issues and insights. Journal of measurement and evaluation in education and      psychology, 6(2): 313-330.
                                                                                                                Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows (Computer software).Lincolnwood, IL: Scientific Software  International.
                                                                                                                Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased  test items. Sage Publications.
                                                                                                                Carlson,J.E & D avier,M.V. (2013). Item response theory. Educational   Testing Service, Princeton, New Jersey.
                                                                                                                Cohen, A. D. (1980). Testing Language Ability in the Classroom.  Rowley, Mass: Newbury House Publishers.
                                                                                                                Crocker, L. and Algina, J. (1986). Introduction to Classical and Modern   Test Theory.  Harcourt, New York.
                                                                                                                Deville, C., & Chalhoub-Deville, M.  (1993). Modified scoring,  traditional item analysis, and  Sato’s caution index used to investigate the reading recall protocol.  Language Testing,  (10), 117-132.
                                                                                                                Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of test development. Lawrence Erlbaum Associates Publishers.
                                                                                                                Edelen, M.O. & Reeve, B.B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research 16(Suppl 1), 5-18.
                                                                                                                Fallahian, E. & Tabatabaei, O. (2015). Construct validity of MSRT reading comprehension   module in Iranian context. English Language Teaching, 8(9), 173-186.
                                                                                                                Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/response statistics, Educational and Psychological Measurement, 58(3), 357- 381.
                                                                                                                Farhady, H. & Hedayati, H. (2009). Language assessment policy in   Iran.  Annual Review of   Applied Linguistics (29), 132-141.
                                                                                                                Farhady, H. Jafarpur, A. and Birjandi, P.(1994). Language skills   testing from theory   to practice. Tehran: SAMT Publications.
                                                                                                                Frey, B. B. (Ed.). (2018).The sage encyclopedia of educational     research, measurement, and   evaluation. Sage Publications.
                                                                                                                Geranpayeh, A. (1994) Are Score Comparisons across Language Proficiency Test Batteries Justified? An IELTS-TOEFL Comparability Study, Edinburgh Working Papers in Applied  Linguistics, 5: 50-65.
                                                                                                                Gilbert, S. & Newtton, W. J.(1997). Principles of educational and psychological measurement and evaluation. Wadsworth:  The   University of California.
                                                                                                                Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan.
                                                                                                                Green, R. (2019).  Item analysis in language assessment. In V. Aryadoost, & M. Raquel (Eds.). Quantitative data analysis for     language assessment volume I: Fundamental techniques.  Routledge.
                                                                                                                Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3rd ed.).  Mahwah, NJ: Lawrence Erlbaum.
                                                                                                                Haladyna, T. M. (2016). Item analysis for selected response items. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of Test Development (2nd ed), (pp. 392–407).   New York, NY: Routledge.
                                                                                                                Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27.
                                                                                                                Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.
                                                                                                                Hambleton, R. K. & Swaminathan, H. (1985). Item response      theory: Principles and applications. Boston: Kluwer Academic      Publishers.
                                                                                                                Henning, G. (1984). Advantages of latent trait measurement in       language testing. Language Testing (1), 123–133.
                                                                                                                Henning, G. (1987). A guide to language testing: Development,     evaluation, research.   Cambridge: Newbury House Publishers.
                                                                                                                Henning, G., Hudson, T. and Turner, J. (1985). Item response theory and the assumption of   unidimensionality for language tests. Language Testing (2), 141–154.
                                                                                                                Janssen, G., Meier, V., Trace, J. (2014). Classical test theory and item response theory: Two understandings of one high-stakes      performance exam. Colombian Applied. Linguistics Journal. 16 (2),    167–184.
                                                                                                                Kane, M. (1992). An argument-based approach to validity.     Psychological Bulletin, (112), 527-535.
                                                                                                                Kiani,G.R. & Haghighi, M.(2006). The investigation of the TMU English proficiency test: Reliability related issues. Journal of Humanities (16), 55-73. 
                                                                                                                Kline, T.J.B. (2005). Psychological Testing: A Practical Approach to Design and Evaluation. Sage Publications.
                                                                                                                Kohli, N., Koran, J. & Henn. L. (2015). Relationships among classical test theory and item response theory frameworks via factor analytic models. Educational and Psychological    Measurement, 75(3), 389-405.
                                                                                                                Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328.
                                                                                                                Linacre, J. M. (2015). A user’s guide to WINSTEPS MINISTEP Rasch-model computer   programs. Chicago, IL: Winsteps.com.
                                                                                                                Loe, A. (2021). Intro to IRT. Available at https:// aidenloe.github.io.
                                                                                                                Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
                                                                                                                Malec,W. & KrzemiĆska-Adamek, A. (2020). A practical comparison of selected methods of evaluating multiple-choice options through classical item analysis.  Practical Assessment, Research, and Evaluation: Vol.25, Article 7. Retrieved from https://scholarworks.umass.edu/pare/vol25/iss1/7
                                                                                                                Mehrens, W. A., & Lehmann, I.J. (1991). Measurement and evaluation in education and psychology (4th ed). Belmont, CA: Wadsworth.Thomson Learning.
                                                                                                                Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational Measurement, (3rd ed. pp.  13-103). New York: American Council of Education and Macmillan Publishing Company. 
                                                                                                                Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics: Application of item response theory models in personality research. In R.W. Robins, R.C. Fraley, & R.F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 407-423). New York: Guilford.
                                                                                                                Moses, T. (2017). A review of developments and applications in item analysis. In R. Bennett & M. von Davier (Eds.), Advancing human assessment. The methodological, psychological and policy contributions of ETS (pp. 19–46). Springer Open. http://dx.doi.org/10.1007/978-3-319- 58689-2_2.
                                                                                                                Mousavi, A. (2009).An encyclopedic dictionary of language testing.   Rahnama Press, Tehran.
                                                                                                                Nguyen, T. H., Han, H. R., Kim, M.T. & Chan, K.S.(2014).An introduction to item response theory for patient-reported outcome measurement. Patient, (7), 23-35. Springer. https://doi.org/10.1007/s40271-013-0041-0
                                                                                                                Noori, M. & Hosseini Zadeh, S. (2017). The English Proficiency Test of the Iranian Ministry of Science, Research, and Technology: A  Review. International Journal of English Language & Translation Studies. 5(3). 21-26.
                                                                                                                Osterlind, S.J. (1983). Test item bias. Beverly Hills: Sage Publications.
                                                                                                                Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed response, performance, and other formats (2nd ed.). Boston, MA: Kluwer Academic.
                                                                                                                Ockey, G.J. (2012). Item response theory. In G. Fulcher and F.  Davidson (Eds.), The Routledge handbook of language testing,      (pp.336-345). Routledge.
                                                                                                                Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educational leaders. Boston, MA: Allyn & Bacon.
                                                                                                                Rizopoulos, D. (2018). ltm:  An R package  for latent trait models under IRT. Retrieved from   https://github.com/drizopoulos/ltm.
                                                                                                                Robitzsch, A. (2019). 
sirt: Supplementary item response theory models. R package version 3.4-64. Retrieved from 
https://CRAN.R-project.org/package=sirt
                                                                                                                 Sahrai, R. & Mamagani , H. (2013). The assessment of the reliability and validity of the MSRT proficiency test. The Educational Assessment Journal, 10(3), 1-19 [In Persian].
                                                                                                                Salehi, M. (2011). On the construct validity of the reading section of the University of Tehran     English Proficiency Test. Journal of English Language Teaching and Learning, (222), 129-159.
                                                                                                                Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), The companion to language assessment. Vol. 3. Hoboken, NJ: Wiley Blackwell.
                                                                                                                Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of   research in education, 19 (pp. 405-450). Washington, DC: American Educational Research Association.
                                                                                                                Traub, R. E. (1997). Classical test theory in historical perspective.   Educational Measurement Issues and Practice (16), 8–14.
                                                                                                                Tsutakawa, R. k, & Johnson, J.C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, (55), 371-390.
                                                                                                                Wiersma, W., & Jurs, S. (1990). Educational measurement and testing.  Needham Heights,   MA: Allyn and Bacon.
                                                                                                                                                                                                                                Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL:  MESA Press.
                                                                                                                Yu, C. H. (2010). A simple guide to the item response theory (IRT) and Rasch modeling.  Retrieved from http://www.creative-wisdom.com.
                                                                                                                Zimowski, M., Muraki, E., Mislevy, R. J., Bock, R. D. (2002). BILOG-MG [Computer software]. Lincolnwood, IL: Scientific Software International.