Seminar in Psychometrics

Seminar of the COMputational PSychometrics group features local and visiting scholars presenting current research on computational aspects of psychometrics. The talks are approximately 60 minutes long, followed by a discussion. In spring semester, the seminar is jointly held as a course NMST571 at Charles University which usually takes place on Tuesdays from 3:40 PM CET. The seminar is co-hosted by Patrícia Martinková and Jiří Lukavský. If you want to participate and/or be added on a mailing list, please send an e-mail to

Future sessions

December 20, 2022 (2:00 PM CET). David Kaplan (University of Wisconsin – Madison): Probabilistic Forecasting with International Large-Scale Assessments: Applications to the UN Sustainable Development Goals

Note. Plenary room (room. 318, second floor) Institute of Computer Science, Pod Vodárenskou věží 2, Prague 8, also on Zoom.

Abstract. In 2015, the Member States of the United Nations (UN) adopted the Sustainable Development Goals. With regard to education, the UN identified equitable, high-quality education, including the achievement of literacy and numeracy by all youth and a substantial proportion of adults, both men and women, as one of its global SDGs to be attained by 2030. To analyze education policies such as these, it is critically important to monitor trends in educational outcomes over time. Indeed, as educational systems around the world face new challenges due to the COVID-19 pandemic, monitoring trends in educational outcomes could help identify the long-run impact of this unprecedented health crisis on global education. To this end, international large-scale assessment programs such as PISA are uniquely situated to provide population-level trend data on literacy and numeracy outcomes. The purpose of this talk is to describe a new project in collaboration with the University of Heidelberg and funded by the US Institute of Education Sciences. This project proposes a methodology applicable to international large-scale assessments, and PISA in particular, to monitor and forecast changes in gender equity and to relate changes over time in gender equity to policy-relevant predictors and exogenous events such as the coronavirus pandemic. We utilize a Bayesian workflow to account for uncertainty in all steps in the modeling process, including uncertainty in the parameters of the model as well as model uncertainty in the choice of policy-relevant predictors. A proof-of-concept using data from the United States NAEP program provides a demonstration of the ideas.

Kaplan, D., & Huang, M. (2021). Bayesian probabilistic forecasting with large-scale educational trend data: A case study using NAEP. Large-scale Assessments in Education, 9(1), 1-31.
Kaplan, D., & Jude, N. (2021). Trend analysis with international large-scale assessments: Past practice, current issues, and future directions. In International Handbook of Comparative Large-Scale Studies in Education: Perspectives, Methods and Findings (pp. 1-14). Cham: Springer International Publishing.

Past sessions

September 27, 2022 (3:40 PM CET). Gabriel Wallin (London School of Economics and Political Science): A Flexible IRT Framework For Latent DIF Detection

Note. Plenary room (room. 318, second floor) Institute of Computer Science, Pod Vodárenskou věží 2, Prague 8, also on Zoom.

Abstract. The measurement validity of instruments like a questionnaire or a test is established by ascertain that it is measurement invariant across the items. For this purpose, it is standard procedure to assess the presence of differential item functioning (DIF), which evaluates measurement invariance on item level. When DIF detection is not based on manifest groups but on latent groups the problem is typically referred to as latent DIF detection, which will be the focus of this talk. To that end, I will present a flexible modeling framework that combines a general latent factor model with a latent class model to capture both normal response behavior under no DIF, and deviant behavior due to DIF. In the model, a sparse DIF effect parameter is introduced that is allowed to vary between the latent classes identified by the model. Each item response distribution is consequently modeled as a function of a latent variable measuring the underlying construct of the questionnaire or test, and of group membership. No prior knowledge of DIF-free items is required, instead, they are identified through an L_1 penalty on the DIF effect parameter in the marginal likelihood function. An EM algorithm for model estimation is proposed, where the maximization step is carried out using a quasi-Newton proximal algorithm. Results based on both simulated and empirical data together with theoretical results will be presented.

May 10, 2022 (3:40 PM CET). Irini Moustaki (London School of Economics and Political Science): Detection of two-way outliers in multivariate data and application to cheating detection in educational tests

Note. K4 at MFF UK, Sokolovká 83, Prague 8, also on Zoom.

Abstract. In the talk we will discuss a latent variable model for the simultaneous (two-way) detection of outlying individuals and items for item-response-type data. The proposed model is a synergy between a factor model for binary responses and continuous response times that captures normal item response behaviour and a latent class model that captures the outlying individuals and items. Covariates are also added to enhance the classification power of the model. A statistical decision framework is developed under the proposed model that provides compound decision rules for controlling local false discovery/ nondiscovery rates of outlier detection. Statistical inference is carried out under a Bayesian framework for which a Markov chain Monte Carlo algorithm is developed. The proposed method is applied to the detection of cheating in educational tests, due to item leakage, using a case study of a computer-based nonadaptive licensure assessment. The performance of the proposed method is evaluated by simulation studies.

Yunxiao Chen, Yan Lu, & Irini Moustaki. Detection of two-way outliers in multivariate data and application to cheating detection in educational tests. Annals of Applied Statistics (In press). arXiv preprint 1911.09408

May 3, 2022 (3:40 PM CET). Yves Rosseel (Ghent University): The structural-after-measurement (SAM) approach to structural equation modeling

Note. On Zoom and in K4 at MFF UK.

Abstract. In structural equation modeling (SEM), the measurement and structural parts of the model are usually estimated simultaneously. In this presentation, I will revisit the long-standing idea that we should first estimate the measurement part, and then estimate the structural part. We call this the 'Structural-After-Measurement' (SAM) approach to SEM. I will describe a formal framework for the SAM approach under settings where the latent variables and their indicators are continuous. I will also discuss earlier SAM methods and establish how they are specific instances of the SAM framework. Simulation results will be presented showing several advantages of the SAM approach: 1) estimates exhibit smaller finite sample biases under correctly specified models, 2) estimation routines are less vulnerable to convergence issues in small samples, and 3) estimates are more robust against local model misspecifications. The SAM framework includes two-step corrected standard errors, and permits computing both local and global fit measures. Finally, for a large class of models, non-iterative estimators can be used in both stages.

Rosseel, Y. & Loh, W. W. (2021). A structural-after-measurement (SAM) approach to SEM. OSF preprint

Apr 19, 2022 (3:40 PM CET). David Magis (IQVIA Belux): Computerized adaptive and multistage testing: overview, challenges and applications

Note. Remotely on Zoom, projected to K4 at MFF UK.

Abstract. Computerized adaptive testing (CAT) and multistage testing (MST) are two closely connected fields of theory and applications of psychometrics. They are both a source of intense scientific research and wide areas of applications for educational measurement and assessment. Though conceptually simple (core of CAT and MST can be explained in a few sentences), they require a strong underlying measurement theory, an accurate algorithmic process and a suitable platform for test administration and evaluation. During this talk, I will (a) introduce the concepts and aspects of CAT and MST; (b) highlight assets, drawbacks and challenges; (c) overview the current resources for CAT and MST deployment. Some real demonstrations of CAT and MST illustrations using the R software will also be proposed.

Magis, D., Yan, D., & von Davier, A. A. (2017). Computerized adaptive and multistage testing with R (using packages catR and mstR). UseR! series. New York: Springer. doi:10.1007/978-3-319-69218-0

Mar 31, 2022 (2:30 PM CET). Michela Battauz (University of Udine): Equating and DIF detection in the IRT framework

Note. Remotely on Zoom.

Abstract. Differential Item Functioning (DIF) occurs when the probability of a positive response for people at the same ability level varies in different groups of individuals. The detection of DIF is very important since it constitutes a violation of the invariance assumption of Item Response Theory (IRT) models. One approach for the detection of DIF is based on the comparison of the item parameter estimates obtained in different groups of the population. However, when the item parameters are estimated separately, they are expressed on different measurement scales, due to identifiability issues. So, it is first necessary to convert the item parameter estimates to a common scale. This transformation involves two unknown constants, called equating coefficients, which are estimated from the data. The item parameter estimates converted to a common metric are then compared through a statistical test. The test implemented in the R package equateIRT takes into account the variability introduced by the estimation of the equating coefficients, thus improving the properties of the test. In this talk, the methods for the estimation of the equating coefficients will be reviewed and a test for the detection of DIF will be presented. The methods will be illustrated using the equateIRT package.

Battauz, M. (2019). On Wald tests for differential item functioning detection. Statistical Methods & Applications, 28(1), 103-118. doi:10.1007/s10260-018-00442-w
Battauz, M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68(1), 1-22. doi:10.18637/jss.v068.i07

Mar 8, 2022 (3:40 PM CET). Dakota Cintron (UC San Francisco): A Latent Dirichlet Allocation Model of Action Patterns

Note. Remotely on Zoom, projected to K4, MFF UK.

Abstract. Action pattern data are process data often recorded in a computer-based large-scale testing setting and extracted from log files. The action pattern data portray different actions that test takers use to solve a given item. This research uses unsupervised and supervised latent Dirichlet allocation (LDA) topic modeling on action pattern data from a large-scale assessment. Topic modeling, which includes the LDA model, is a machine learning framework to rapidly discover latent topics from large quantities of open-ended qualitative textual data quantitatively. In this research, action pattern data from a large-scale assessment are treated as qualitative textual data to be analyzed with LDA. These latent topics amount to thematic annotations of a collection of documents referred to as a corpus. For the qualitative action pattern data, the LDA model treats documents (here a student’s set of action patterns on an item) as being represented by a random mixture over latent topics where a distribution over words represents each latent topic. As the results of this study demonstrate, the latent topics derived from the action pattern data can provide helpful insight into different cognitive processes and key actions that lead to item success or failure. For instance, this research provides evidence of classes of problem-solving strategies derived from topic distributions of action pattern data and how these strategies are predictive of item success or failure.

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. doi:10.1145/2133806.2133826
Tang, X., et al. (2020). Latent feature extraction for process data via multidimensional scaling. Psychometrika, 85(2), 378-397. doi:10.1007/s11336-020-09708-3
Cintron, D. W., & Montrosse-Moorhead, B. Integrating Big Data Into Evaluation: R Code for Topic Identification and Modeling. American Journal of Evaluation (2021). doi:10.1177/10982140211031640

Jan 27, 2022 (1:30 PM CET). Ai Ye (UNC Chapel Hill): Path and Directionality Discovery in Individual Dynamic Models: A Regularized Structural Equation Modeling Approach

Note. Jointly held as seminar of ISCB - ČR, remotely on ZOOM.

Abstract. Recent decades have witnessed a surge of psychological and neurological research at an individual level. One goal in such endeavors is to construct person-specific dynamic assessments using time series data. Within the psychometric field, researchers have developed psychometric modeling frameworks to estimate time series models. However, these methods are often limited in the dynamic representations as well as the model selection regimes. My dissertation research aims to evaluate (Chapter I), reconcile (Chapter II), and extend upon (Chapter III) the limitations in current practices. In this talk, I will focus on Chapter II, where I proposed a novel modeling approach that uses regularization under the unified Structure Equation Modeling (uSEM) framework to estimate a more flexible model, called regularized hybrid uSEM. My simulation study has shown that the proposed approach is more reliable and accurate than alternative methods in recovering hybrid types of dynamic relations and in eliminating spurious ones. The present work, to my knowledge, is the first application of the recent regularized SEM to the estimation of a type of time series SEM, which points to a promising future for statistical learning in psychometric models.

Gates, K. M. & Molenaar, P. C. M. Group search algorithm recovers effective connectivity maps for individuals in homogeneous and heterogeneous samples. Neuroimage 63, 310–319 (2012). doi:10.1016/j.neuroimage.2012.06.026
Epskamp, S., Waldorp, L. J., Mõttus, R. & Borsboom, D. The Gaussian Graphical Model in Cross-Sectional and Time-Series Data. Multivariate Behavioral Research 53, 1–28 (2018). doi:10.1080/00273171.2018.1454823
Ye, A., Gates, K. M., Henry, T. R. & Luo, L. Path and Directionality Discovery in Individual Dynamic Models: A Regularized Unified Structural Equation Modeling Approach for Hybrid Vector Autoregression. Psychometrika 86, 404–441 (2021). doi:10.1007/s11336-021-09753-6

May 11, 2021. Ed Merkle (University of Missouri): Recent progress on Bayesian structural equation models

Abstract. The talk will be about research and developments surrounding the R package blavaan. Specific topics include strategies for speeding up model estimation, methods for computing model information criteria, and extensions to complex models. I will try to discuss the research in the context of open science and reproducibility, which has been a theme of the software development. I will also provide some demonstrations along the way to illustrate the functionality of blavaan.

May 4, 2021. Jiří Lukavský (Institue of psychology CAS & Charles University): Bayesian psychometrics.
April 27, 2021. Gabriel Wallin (Université Côte d'Azur & Inria): Equating nonequivalent test groups using propensity scores

Abstract. For standardized assessment tests, scores from different test administrations are comparable only after the statistical process of equating. In this talk I will discuss equating of test scores when the test groups differ in their ability distributions. The equating procedures, constructed to only adjust scores due to differences in difficulty level of the test forms, thus risk to also adjust for the ability differences. The gold standard for this situation is to utilize a set of common items in the equating procedure. However, not all testing programs have common items available. This presentation considers this setting. In the absence of common items, background information about the test-takers will be gathered in a scalar function known as the propensity score, and the test forms will be equated with respect to this score. This method will be demonstrated using both empirical and simulated data.

April 20, 2021. Marie Wiberg (University of Umea): How to evaluate different equating methods

Abstract. Test score equating is used to make scores from one scale comparable with the scores from another scale. There are a large number of equating methods available depending on how data is collected and what assumptions are made. The talk starts with a brief overview of available equating methods. As there are a large number of equating methods developed for different situations and different tests we need tools to evaluate and compare the different equating transformations. There are a large number of methods and measures proposed to evaluate an equating transformation. In general they can be divided into two groups; equating specific measures and statistical measures. In this talk I will discuss several methods and illustrate them with some examples in R.

April 13, 2021. Michela Battauz (University of Udine): Item Response Theory Equating Methods for Multiple Forms

Abstract. Many testing programs use multiple forms of a test to deal with the security issues connected to test disclosure. However, since each form is composed of different items, the test scores are not comparable. To overcome this issue, it is possible to apply the statistical procedure of equating. This talk focuses on Item Response Theory (IRT) equating methods for the common-item nonequivalent group design. Under this design, the forms have a set of items in common and they are administered to different groups of examinees. The equating process consists in the conversion of the item parameter estimates to a common scale using a linear transformation, and the determination of comparable test scores. The coefficients of this linear function are called equating coefficients. Despite many testing programs use several forms of a test, the equating methods proposed in the literature mainly consider only two test forms. In this talk, the equating methods for two test forms will be reviewed and some newer methods for equating multiple test forms will be presented. The methods will be illustrated using the R packages equateIRT and equateMultiple.

March 30, 2021. František Bartoš (University of Amsterdam & ICS CAS): Robust Bayesian meta-analysis: A framework for addressing publication bias with model-averaging

Abstract. Publication bias poses a significant threat to meta-analyses - the gold standard of evidence. To alleviate the problem, a variety of publication bias adjustment methods was suggested. However, it is nearly impossible to select the correct method when the data generating process is unknown, which is usually the case, since no existing method performs well in a wide range of conditions. To address this issue, we developed a Robust Bayesian meta-analysis (RoBMA) framework. RoBMA allows us to combine different publication bias adjustment models in a coherent Bayesian way. Apart from obtaining the model-averaged estimates, RoBMA provides Bayes factor tests for presence or absence of the meta-analytic effect, heterogeneity, and publication bias. In this talk, I provide a conceptual introduction to Bayesian model-averaging in the context of meta-analyses, illustrate the RoBMA framework on an applied example, and demonstrate the performance of the method on real and simulated datasets.

March 23, 2021. Hynek Cígler (Masaryk University): Reliability models in classical test and latent trait theories
March 16, 2021. Patrícia Martinková (ICS CAS & Charles University): Computational aspects of reliability estimation