Document Type : Research Paper


English Language & Literature Department, University of Tabriz, Tabriz, Iran.


This paper reports on an investigation of native language-based differential item functioning (DIF) across the subtests of Iranian Undergraduate University Entrance Special English Exam (IUUESEE). Fourteen thousand one hundred seventy two foreign-language test takers (including four groups of Azeri, Persian, Kurdish, and Luri test takers) were chosen for the study. Uniform DIF (UDIF) and Non-uniform DIF (NUDIF) analyses were conducted on data from the four versions of IUUESEE. After establishing the unidimensionality and local independence of the data, DIF findings showed that Luri test takers were more advantaged than other native language groups across the subtests. NUDIF analysis uncovered that almost all subtests functioned in favor of low-ability test takers who haven’t been expected to outperform high-ability test takers. A probable explanation for native language-ability DIF was that Luri and low-ablity test takers were more likely to venture lucky guesses. Thoughtless errors and guessing, test-wiseness, overconfidence, stem length, unappealing distractors, and time were proposed as possible causes of DIF in IUUESEE. It was also found that the reading subtest included the large number of items with significant DIF.


Main Subjects

Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from multidimensional perspective. Journal of Educational Measurement, 29(1), 67–91.
Ackerman, T.A., Simpson, M.A., & de la Torre, J. (2000). A comparison of the dimensionality of TOEFL response data from different first language groups. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, Louisiana.
Alderman, D., & Holland, P. (1981). Item performance across native language groups on the TOEFL. TOEFL Research Report Series, 9, 1-106. Princeton, NJ: Educational Testing Service.
American Education Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). The standards for educational and psychological testing.Washington, DC: AERA Publications.
Aryadoust, V., Goh, C., & Lee, O. K. (2011). An investigation of differential item functioning in the MELAB Listening Test. Language Assessment Quarterly, 8(4), 361–385. DOI: 10.1080/15434303.2011.628632
Aryadoust, V. (2012). Differential Item Functioning in While-Listening Performance Tests: The Case of the International English Language Testing System (IELTS) Listening Module. International Journal of Listening, 26(1), 40–60. DOI: 10.1080/10904018.2012.639649.
Barati, H., & Ahmadi, A. R. (2010). Gender-based DIF across the subject area: A study of the Iranian national university entrance exam. The Journal of Teaching Language Skills, 2(3), 1-26.
Brati, H., Ketabi, S., & Ahmadi, A. (2006). Differential item functioning in high stakes tests: The effect of field of study. Iranian journal of applied linguistics, 9(2), 27-49.
Brown, J.D. (1999). The relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16 (2), 217–38.
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. London, UK: Erlbaum.
Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. 
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language Testing 2 (2), 155–63.
Du, Y. (1995). When to adjust for differential item functioning. Rasch Measurement Transactions, 9(1), 414.
Ferne, T., & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges and recommendations. Language Assessment Quarterly, 4(2), 113-148.
Geranpayeh, A., & Kunnan, A. J. (2007). Differential item functioning in terms of age in the certificate in advanced English examination. Language Assessment Quarterly, 4(2), 190–222.
Gierl, M. J. (2005). Using dimensionality-based DIF analyses to identify and interpret constructs that elicit group differences. Educational Measurement: Issues and Practice, 24(1), 3–14.
Ginther, A., & Stevens, J. (1998). Language background and ethnicity, and the internal construct validity of the Advanced Placement Spanish Language Examination. In A.J. Kunnan (Ed.), Validation in language assessment (pp. 169–94). Mahwah, NJ: Lawrence Erlbaum.
Gipps, C., & Stobart, G. (2009). Fairness in assessment. In C. Wyatt-Smith & J. Cumming (Eds.), Educational assessment in 21st century: Connecting theory and practice (pp. 105-118). Netherlands: Springer Science+Business Media.
Hale, G.A., Rock, D.A., & Jirele, T. (1989). Confirmatory factor analysis of the Test of English as a Foreign Language. TOEFL Research Report, 32, 89-42. Princeton, NJ: Educational Testing Service.
Harding, L. (2011).  Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing, 29(2) 163–180.
Jang, E. E., & Roussos, L. (2009). Integrative analytic approach to detecting and interpreting L2 vocabulary DIF. International Journal of Testing, 9(3), 238–259.
Kim, M. (2001).  Detecting DIF across the different language groups in a speaking test. Language Testing, 18(1), 89–114.
Kunnan, A. J. (1990). DIF in native language and gender groups in an ESL placement test. TESOL Quarterly, 24(4), 741–746.
Kunnan, A.J. (1994). Modelling relationships among some test-taker characteristics and performance on EFL tests: an approach to construct validation. Language Testing, 11(3), 225–52.
Linacre, J. M. (1998a). Detecting multidimensionality: Which residual data-type works best? Journal of Outcome Measurement, 2, 266–283.
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878.
Linacre, J. M. (2010). A user’s guide to WINSTEPS. Chicago, IL:
Linacre, J. M. (2012). A user’s guide to WINSTEPS. Chicago, IL:
Linacre, J.M. (2021). Winsteps® Rasch measurement computer program (Version 5.1).
Linacre, J. M., & Wright, B. D. (1994). Chi-square fit statistics. Rasch Measurement Transactions, 8, 350.
McNamara, T., & Ryan, K. (2011). Fairness versus Justice in Language Testing: The Place of English Literacy in the Australian Citizenship Test. Language Assessment Quarterly, 8, 161-78.
Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of non-uniform differential item functioning using a variation of the Mantel–Haenszel procedure. Educational and Psychological Measurement, 54(2), 284–291.
Muraki, E. (1999). Stepwise analysis of differential item functioning based on multiple-group partial credit model. Journal of Educational Measurement, 36(3), 217–232.
Oltman, P.K., Stricker, L.J. & Barrows, T. (1988). Native language, English proficiency, and the structure of the Test of English as a Foreign Language for several language groups. TOEFL Research Report. 27, 88-26. Princeton, NJ: Educational Testing Service.
Prieto Maranon, P., Barbero Garcia, M. I., & San Luis Costas, C. (1997). Identification of nonuniform differential item functioning: a comparison of Mantel–Haenszel and item response theory analysis procedures. Educational and Psychological Measurement, 57(4), 559–569.
Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355–371.
Roussos, L. A., & Stout, W. F. (2004). Differential item functioning analysis. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 107–116). Thousand Oaks, CA: Sage.
Roznowski, M., & Reith, J. (1999). Examining the measurement quality of tests containing differentially functioning items: Do biased items result in poor measurement? Educational and Psychological Measurement, 59(2), 248–269.
Ryan, K., & Bachman, L. (1992). Differential item functioning on two tests of EFL proficiency. Language Testing, 9(1), 12–29.
Sasaki, M. (1991). A comparison of two methods for detecting differential item functioning in an ESL placement test. Language Testing 8 (2), 95–111.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58(2), 159–194.  
Smith, R. M. (1996). Polytomous mean-square fit statistics. Rasch Measurement Transactions, 10(3), 516–517.
Swaminathan, H. (1994). Differential item functioning: A discussion. In D. Laveault, B. D. Zumbo, M. E. Gessaroli, & M. W. Boss (Eds.), Modern theories of measurement: Problems and issues (pp. 63–86). Ottawa, Ontario, Canada: University of Ottawa.
Swinton, S.S., & Powers, D.E. (1980). Factor analysis of the Test of English as a Foreign Language for several language groups. TOEFL Research Report, 6, 80-32. Princeton, NJ: Educational Testing Service.
Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language Testing, 17(3), 323–340.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In Holland, P. W. and Wainer, H. W., (Eds.), Differential item functioning. (pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum.
Uiterwijk, H., & Vallen, T. (2005). Linguistic sources of item bias for second-generation immigrants in Dutch tests. Language Testing, 22(2), 211–234.
Wright, B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.
Wright, B. D. (1994b). Local dependency, correlations, and principal components. Rasch Measurement Transactions, 10(3), 509–511.
Wright, B. D., & Stone, M. H. (1988). Identification of item bias using Rasch measurement. (Research Memorandum No. 55). Chicago, IL: MESA Press.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different minority groups? Some Israeli findings. Language Testing, 3(1), 80–98.
Zenisky, A., Hambleton, R., & Robin, F. (2003). Detection of differential item functioning in large scale state tests: A study evaluating a two-stage approach. Educational and Psychological Measurement, 63(1), 51–64.
Zhang, Y., Matthews-Lopez, J., & Dorans, N. (2003). Using DIF dissection to assess effects of item deletion due to DIF on the performance of SAT I: Reasoning sub-populations. Educational testing Service.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Ontario, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from the British Colombia University Web site:
Zumbo, B. D. (2007). Three generations of DIF analysis: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233.
Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different minority groups? Some Israeli findings. Language Testing, 3(1), 80–98.
Zeidner, M. (1987). A comparison of ethnic, sex and age bias in the predictive validity of English language aptitude tests: Some Israeli data. Language Testing, 4(1), 55–71.