by Lester Long Jr. MS CPC CSS CADC CPS
The SAT is arguably the single most important test for American high school students. Every year over 2 million young people take this standardized multiple choice test and most four-year colleges and universities use the results to evaluate applicants from more than 20,000 disparate U. S. High Schools. Considering the growing importance of this test, is there any wonder there would not be continual concern about it reliability. The SAT, though the barometer by which most colleges and university determine their admissions, the question remain is it truly an accurate assessment of an individual’s ability or potential?
The Validity and Reliability and its Effect on Cultural and Racial Minorities
The SAT was first administered in 1926 and the original plan was formulated by Robert Yerkes, Henry T. Moore, and Carl C. Brigham. A conscious effort was made to develop an instrument that would measure neither school achievement nor general mental alertness. However, with the passage of time, resemblance to its prototype lessened: sub-test became fewer; speediest was reduced; and emphasis was placed upon two relatively homogeneous items types-verbal and mathematical. Currently the two sub-scores are referred to as the SAT-V and SAT-M (Educational Testing, S. n. d.). These test scores must reflect reliability and validity (Buchmmann, Condron, and Roscigno, 2010).
Looking at the tremendous amount of reliance on this “National Standardized Testing” process there is no wonder how so many people question the reliability and validity of this test. With over two million young adults taking this test and most four-year colleges and universities using these results to evaluate applicants from more 20,000 disparate U. S. High Schools, the reliability and validity must also meet a national standard. As a part of a large-scale project to remodel the SAT, a study was conducted in 1987 by the “New Possibilities Project” directed towards making the SAT more educationally relevant to all segments of the test taking population and to provide more meaningful information to colleges, high schools, counselors, and students. Towards this end, both the verbal and mathematical sections of the SAT were to be expanded (Hale, et al., 1992). In addition, a test writing ability, including an essay component, was to be incorporated into the companion tests that was to an Achievement Tests, now known as the SAT II.
Over the years there has been much controversy over the heavy reliance on SAT scores and their effect on the admissions process for colleges and universities. The issue seems to stem from how admissions personnel tend to use this test, which was only intended to be an indicator of ability, into an absolute decision makers for university admissions, leaving out, at times other important aspects of students’ qualities and abilities. This is has been particularly true as it relates to cultural and ethnic minorities. Rothstein (n. d.) points out that vociferous debate has emerged regarding the fairness of the SAT and the extent to which it should be used in the college admissions process. Over-reliance on SAT scores in college admissions has broad and clear-cut implications for issues of merit and diversity in the educational sorting and credentialing process (Rothstein, n. d.). No less profound, especially for the question of merit, is the likelihood that access to and use of test preparation vary by family background of students.
Validity research becomes extremely important when reviewing the above paragraph because this research must then take into account how family background correlates with college performance as does the SAT. Though the extent to which the SAT’s predictive validity derives from this correlation with family background it is crucial to the interpretation of predictive models. It is crucial to keep in mind; however, that predictive validity has little to do with causation. Evidence for predictive validity of SAT scores, for example, cannot be interpreted as evidence that SAT scores have a causal effect on student performance. Rather, predictive validity studies are often interpreted as evidence for statements such as “The SAT has proven to be an important predictor of success in college… SAT scores add significant to prediction” (Camara and Ecternacht, 2000 p.1). This has a profound impact on score interpretation and the placement process.
Although the SAT was not designed to be placement instrument it has been become a ready tool at the hands of placement officials. This can lead to unfortunate consequences if those who are determining the placement have misinterpreted the scoring interpretation. Validity models that use the unadjusted SAT score with demographic controls overstate the direct contribution of individual SAT to prediction, attributing to it substantial variation that better attributed to readily observed school characteristics (Rothstein, n. d.). However, the strongest argument for using scores from the SAT exam in college admissions is that these scores help admissions offices predict students eventual performance in college.
In theory, the educational movement that initiated standardized testing for purposes of college admissions originally held the promise of identifying students of merit from diverse social-class and ethnic backgrounds who otherwise would not have been considered for admission into the nation’s select colleges. But, in practice, this early promise has not been fulfilled, especially for minority groups whose mean test performance has departed significantly from the White mainstream test takers. Over
the past several decades, the search for a more equitable ethnic representation in our nation’s select colleges led to the adoption of affirmative action, however, this policy has come under attack therefore, the higher education system must continue to develop ways to make to SAT more reliable and valid for both the minority as well as the majority in our society.
As the SAT’s long history has suggests, it standardization sample and norms have changed over the years (Stickler 2007). The test that preceded was standardized on a sample of 978 predominantly upper and upper middle class white Anglo-Saxon males from New England; while this sounds hopelessly biased, Stickler points out it did accurately reflect the population of college students at the time. The original SAT was not standardized until 1941; prior to that, the members of each year’s cohort were compared to each other only. By 1941 multiple yearly administrations of the SAT became necessary to accommodate more college applicants, creating the problem of non-equivalence between applicants who took the test with more or able peers during a single year (Dorans, 2002). Thus, the College Board decided that all students who completed the April 1941 SAT would serve as standardization sample for verbal section, and those who took the April 1942 SAT would constitute the mathematics standardization sample; these reference groups were not updated until April 1995.
Although the 1941 and 42 standardization group of 10,000 test takers consisted primarily of self-selected privileged, White male applying to prestigious New England colleges (the College Board’s membership base at the time) changes in population of test takers over the next few decades had to be weighed against the entrenched utility of the test. Stickler (2007) points out that scores, especially verbal scores, declined from the original means of 500 and population variance increased from the 1940’s standard deviation of 100 as the population of college students exploded in the 1950’s and 1960’s, but SAT users were hesitant to revamp a test that had already been carefully established (Dorans, 2002). By 1990, the mean SAT verbal score was 425 and the mathematics score was 475, a substantial drift that indicated meaningful population changes and a lack of alignment between the mathematics and verbal scales (Dorans, 2002).
Finally, the College Board re-centered SAT mathematics verbal scoring, using the 1990 cohort of 1,052,000 college bound seniors as it new reference group. Dorans, (2002) points out that re-centering the SAT meant both collecting data from a new standardization sample and recalling the distribution of scores to reset the mean to 500 and the standard deviation to 100 for both sections of the test. These students’ average verbal score of 424 was statistically equated to 500, as was the average of 476 on the mathematics section. By measuring the entire population of test takers in 1990, the College Board accurately captured increases in numbers and in racial, class, and gender diversity of college applicants in the new SAT norms.
Measurement Abilities and Test Cluster Scoring
In the measurement abilities, technically, the SAT may be regarded as highly perfect possibly reaching the pinnacle of the current state of the art of psychometric (Educational Testing, S., n. d.). Actually it would be surprising if this were not the case. Ever since the SAT was first administered in 1926, highly competent professional staffs have been available at all times to prepare new forms, being guided by objective findings on past administrations an on items analysis of experimental material. Like many developments in measurement, the SAT is a direct descendant of the Army Alpha.
Test cluster scores reported for the SAT have provided cluster scores based on content specifications or item type and such scores do not typically provide great insight into whether an examinee will correctly answer a particular test (Ewing, Huff, Andrews, and King, 2005). This is because to meet test guidelines the content specifications for a particular domain are written to cover a range of difficulty. As a result, there are usually some easy, medium and difficult items within each domain. An examinee of average abilities would be expected to correctly answer the easy items and most of the medium and difficulty across all content domains. In this situation, feedback based solely on content domains or item type would suggest to the student that he or she needs improvement in all areas, which is not very informative.
In connection with the new SAT that was introduced in March 2005, research has been under way to investigate the feasibility of providing examinees with score reports that contain feedback on skills measured by critical reading, mathematics and writing sections of the test. Furthermore, recipients of educational score reports generally welcome the idea of receiving more descriptive feedback about examinee performance than provided by total score or a percentile rank indicating overall performance ( Ewing, Huff, Andrews, and King, 2005). This is not surprising as descriptive score reports have the potential to aid score users in the development of student-based instructional plans and /or suggest areas for classroom-based instructional intervention.
Is the SAT overemphasize?
The idea that students should be judged on their ability is closely associated with the SAT (Geiser, 2009). This process though, it has captivated American college admissions since the test was first introduced, but well has it done at predicting student performance in college. Geiser (2009) points out that older College Boards had knowledge of college-preparatory subjects, the SAT introduced in 1926, purported to measure a student’s capacity to learn. He goes on to point out that this idea dovetailed perfectly with the meritocratic ethos of American College admissions. However, over the years has it really been able to truly predict the ability of student college performance? In a recent article published by the College Board, this body pointed out that the SAT has proven to be an important predictor of success in college (Geiser and Studley, 2002). It has proven over the years to a useful tool in the assistance of admission counselors in determining placement and admissions. Not originally developed to assess achievement or mental alertness it has become the catalysis of American University testing. However, calling the overemphasis on the SAT scores “the education equivalent of a nuclear arms race, “University of California President Richard C. Atkinson recently proposed to abandon the use of the SAT I in the university’s admissions process. He proposes using only standardized test that assess mastery of specific subject areas rather than undefined notions of aptitude or intelligence (Geiser and Studley, 2002, p.2).
When evaluating the above arguments this bring us back to the age-old question: What would be the best psychometric instrument in determining college performance, one based on aptitude and intelligence or one based on achievement? When it comes to aptitude versus achievement in determining college admission, Geiser (2009) points out that the SAT II has been a better predictor of student performance. He maintains that the admission criteria that tap mastery of curriculum content, such as high school grades and achievement test, are more valid indictors of how students are likely to perform in college. He also points out that those admissions criteria that emphasized demonstrated achievement over potential ability are better aligned with needs of the disadvantaged students and schools. He points out that data gather by the University of California (UC) showed that the SAT I has more of an adverse impact on poor and minority applicants than traditional measures of academic achievement.
In a study examining the relative contribution of high school grade point averages (HGPA), SAT I and SAT II scores in predicting college success sampled 77,893 first time freshman who entered UC over a four-year period from the Fall of 1996 through Fall 1999, Geiser and Studley (2002) point out that the only students excluded were students missing SAT scores or high school GPA’s; students who did not complete freshman year and /or did not have a GPA recorded in the UC Corporation Database; freshman in UC Santa Cruz, which did not assign conventional grades at the time; freshman entering UC Riverside in 1987 and 1998 during which years the campus data upload into Corporate Student System had extensive missing data.
HSGPA used in this analysis was an honor weight GPA with grade points for honor level courses; HSGPA was uncapped and cold exceed 4.0 and SATI scoring and SATII scoring were as follows: SAT I scoring analysis represented a composite of students’ scoring on verbal and mathematic test, whereas the SATII was a composite of three achievement test; writing, mathematics, and a test of the student choose which are used to determine eligibility for admission at UC. Results showed that freshman grades were highly correlated with cumulative GPA and that SAT coupled with GPA is a much
better indicator of college success. This would indeed indicate that the SAT alone is not the best reliable indicator of college success. It may in fact be overemphasize.
Cultural, Statistical and Economic Bias
Freede (2003) posits that the SAT has been shown to be culturally and statistically based against African-Americans, Hispanic Americans and Asian Americans. He argues that the R-SAT which scores only “hard items on the test is shown to reduce the mean-score difference between African-Americans and White SAT test takers by one-third. Further, the R-SAT shows an increase in SAT verbal scores by as much as 200 to 300 points for White individuals over individual minority test-takers.
Freede (2003) points out that a standardized test is culturally biased when one group (typically minority population) performs consistently lower than some reference population-typically, the White population. He adds that a test is statistically biased if two individuals (e.g. One African-American and one White) who get the same test score but nevertheless performs differently on some criterion external to the test, such as school grades.
Economic status has played a major role in how many applicants have been able to perform on the SAT. Buchmann, Condron and Roscigno, (2010) point that there is a theoretical construct known as “Shadow Education”(p.1). This is associated with higher income families, most often used in comparative education research and it refers to educational activities, such as tutoring, extra classes ,occurring outside of formal channels of an educational system that are designed to improve a student’s chance of successfully move through the allocation process. Buchmann, Condron, and Roscigno (2010) point out that there is a large and significant disparity in SAT scores by family income and parental education. Asian score about 35 points higher than Whites, while African-Americans score about 40 points lower. They posit that this is due to the family having the income to hirer tutors and others to provide applicants with other assistance outside of the formal educational system.
Evidence of Reliability
Despite changes in the population of test takers and education over the last 75 years, the SAT has proven highly reliable. In fact, reliability studies have yielded such consistent result that researchers focus instead on examining the criterion-related incremental and construct validity of the SAT (Burton, 2001). As researchers have found good evidence of the SAT validity, it can be inferred the SAT has met the statistical prerequisite of reliability.
But how reliable is the SAT? The College Board website reports that student’s true score is within 30 points of his or her measured score (SEM= 30, College, FAQ, 2005). A test with a standard deviation of 100 and SEM of 30 has an internal consistency coefficient of .91 (a=.91). In a study specifically investigating SAT I scores changes upon repeated testing, 1,120,563 students in the 1997 college-bound cohort who took the SAT one to five times in their junior and senior years gained an average of 7 to 13 points on the verbal section and 8 to 16 points on the mathematics section (Nathan and Camara, 1998). Thus, a student who retested at the higher end of both ranges still would not breach the standard error of measurement, indicating high test-retest reliability.
Reliability statistical data in February 1963 reported the medium correlation between SAT-V and SAT-M for 14 form of the SAT administered in the period 1959-62 was .64: about ten years ago the average correlation for comparable three-year period was .54. The test-retest reliability coefficients were unable at the time, but a study by Richard S. Levine in which the verbal and mathematics aptitude scores of the Scholarship Qualifying TEST were correlated with the two SAT scores and prove to have a high correlation; SAT-V and SQT-V .85 and SAT-M and SQT-M .81.
The College Board in an effort to receive more data on the reliability and internal consistency of the SAT contracted with a number of experts to conduct in a study on the reliability and internal consistency of the SAT on skills measurements in the areas of critical reading, mathematics and the writing section of the SAT. Parties bought in for this study were content specialists, measurement experts as well as cognitive psychologist. These experts conducted test using the two forms of the SAT on 500 juniors in high school with 17 school participating two scores were computed (1) internal consistency and (2) alternative-form reliability. Estimates of alternative –form reliability of each skill, Pearson product moment correlated estimated within skills across the form, that is, raw number correct score on the form was correlated with corresponding raw number and correct score on the form. Alternative-form reliability estimates ranged from zero to one with higher values. This aspects of the study concluded that students performed about the same and most students exhibited acceptable alternative-form reliability.
In the areas of internal consistency estimates at the skill level total test level varied by the SAT ( i. e., critical reading, mathematics and writing), but not by form (i.e., form 1 versus form 2). Internal consistency was estimated separately from skill by the form using the formula for Cronbach Alpha for both forms. The internal consistency estimates at the total level were .93 for critical reading .92 for mathematics, and .83 for writing. On the skill level the internal consistency estimates from both forms ranged from .69 to .81 for mathematics, .68 to.81 critical reading and for writing .40 to 67. Findings showed that for the exception of writing skills, most students exhibited acceptable internal consistency.
Deferential Validity of the SAT-V and SAT-M
From the beginning, the SAT has been found to have reasonably good validation for predicting college achievement. Also it has been found consistently that the SAT increases the validity of high school average or rank. Burton (2001) points that studies made in 1927 showed a median validity of school records of .52, a median validity of the SAT of .34, and of the combination of both, .55. For school of engineering, science, business, education, and liberal art, the high school record is more valid than the SAT but the SAT supplements the high school record with additional valid variance. Expect in engineering the SAT-V is generally more predictive than SAT- M. How well the SAT-V and the SAT-M correlate?
Research of test content over the years has shown it is possible to construct sub-test which measure these two academically important aptitudes. Test conducted by the Indiana University yielded these results: the SAT obtained validity coefficients of .54 and .41 for SAT-V and SAT-M respectively. Publication by the College Board shows that the SAT has validity for predicting freshman grade point averages(validity coefficients usually range between .30 and .55) and the achievement test have acceptable validity for predicting grades in appropriate subjects(coefficients also usually range between .30 and.55) (Educational Test, n. d.).
In the category of predictive validity studies, Burton (2001) meta-analysis of studies of classes graduating between 1980 and 2000 found that combined SAT scores accurately predicted many measures of success in college. Additionally, Burton confirmed that high school GPA and SAT scores were consistent and equally precise predictors of college success of women and African-Americans students, and differently-abled students; that although SAT scores under-predicted theses groups’ college GPAs, high schools under-predicted their performance to the same extent.
Additional predictive validity studies focused on the SAT‘s ability to predict college success for specific populations. Ting (2000) found that SAT mathematics scores and students’ realistic a self-appraisal contributed most significantly to freshman GPA for Asian American students. SAT mathematics scores may have been useful predictors whereas verbal scores were not because Asian Americans tend to value applied science more highly than humanities and pursue technical degrees. In a study of the factors that enhance SAT prediction for African-Americans students, Fleming (2002) found that the SAT best predicted undergraduate GPA for black males at traditionally black colleges.
When African-American were study in relationship to attendance at black colleges, the predictive validity of the SAT for African-American students rose from .46 to.57 suggesting that non-academic, adjustment factors of black students in white colleges, rather than test bias, account for SAT’s under-prediction of minority students’ GPAs. (Fleming, 2000).
Other researchers have established that SAT’s construct validity by studying the extent to which SAT scores vary according to theoretical predictions. For example Everton and Millsap (2004) found that educational opportunities and experience both inside and outside the classroom correlated with SAT scores; the fact students who are better educated perform better on the SAT supports the claim that measures learned knowledge and skills rather than wealth or prowess with the multiple-choice format. Additionally, latent variables models indicated that educational experience inside the classroom, through extracurricular activities, and at home moderated the relationship between socioeconomic status and SAT achievement.
Over the years there has been much controversy over the relevance of the SAT. Since its inception it has been the single most important tool for college and university admissions. This has caused a lot of feeling to run deep concerning it dominates in the lives of so many individuals. But why is this? Research has shown that it is indeed a reliable tool in predicting student success and performance. It has been proven to have validity in terms of relevance. But issues seem to stem around “Should this be the primary factor in determining the future of a young individual who are attempting to meet standards for which they were not prepared”.
The SAT have modified its versions of this test but never it basic standards and this has been a source of review for many decades. This is why the test is not without its detractors, some whom regard it as a tool of the academic establishment. No attempt is made to adjust SAT scores for sex,socioeconomic status, race or educational background and in this writers opinion it should not. Such attempts would render the test ineffective However, predictability is much more substantial with the use of high school grades to assistance in the right program for the right student.
The SAT’s statistical properties are remarkable, yet the test and its makers are frequently criticized for the SAT’s biases. However, these biases are not inherent in the test but the result from the ways in which the test is used and biases of the American culture. It is not the SAT’s fault that people track negligible difference in aggregated SAT scores as if they determined a municipality’s or an ethnic group’s worth. It is no the fault of the SAT that students attend more crowded and less academically rigorous schools (and often are members of minority groups) does not perform well on the test. There must be some changes to help our young people but it must first start in the home.
Buchmann, Condron, and Rosigino (2010), Shadow education, American style: Test preparation,
SAT and college enrollment. Social Force, 88 (2), 436-462.
Burton , N (2001), Predicting success in college SAT studies of classes graduating since 1980 (College Board Research Report No. 2001-2) Princeton, NJ: Educational Testing Services.
Camara, W. J. and Echtermaht, G. (2000), The SAT I and high school grades: Utility in predicting success in college (College Board Research Notes RN-10) . Princeton, NJ: Educational Testing Services.
College Board (2005) SAT FAQ. Retrieved June 31, 2011 from http://www.collegeboard.com/highered/ra/sat/sat.scorefag.html.
Dorans, N (2002), The recentering of SAT scores and its effects on score distributions and score interpretations. (College Board Research Report No. 2002-11) Princeton, NJ: Education Testing Service.
Educational Testing, S. (n.d), College Board Scholastic Aptitude Test. Retrieved June 31, 2011 from http://search.ebscohost.com.library.capella.edu/loginaspx.
Educational Testing, S (n.d.) College Entrance Examination Board Admission Testing Pogram. Retrieved
June 31, 2011 from http://ezproxy.library.capella.edu/login.
Everston, H. T. and Millsap, R. E. (2004), Beyond individual differences: Exploring school effects on SAT scores. Educational Psychology, 39, 157-172
Ewing, M., Huff, K., Andrews, M. and King, K. (2006) Assessing the reliability of skills measured by the SAT. Office of Research and Analysis, Research Notes, 24, pg. 1-8, College BoardFleming, J (2002), Who will succed in college? When the SAT predicts black students ’performance. The Review of Higher Education, 25, 281-296
Freedle, R. O. (2003), Correcting the SAT’s ethnic and social-class bias: A method for re-estimating SAT scoring. Harvard Educational Review 73 (1) p.1-43.
Hale, G. A. (1992), A comparison of the predictive validity of the current SAT and Experimental Prototype. Research Report. Educational Testing Services 92 (32), p.1-61.
Geiser, S. (2009), Back to basics: In defense of achievement (and achievement test) in college admissions. Change: The Magazine of Higher Learning, 41 (1), 16-23.
Geiser, S. and Studley, R. (2002), UC and SAT: Predictive validity and differential impact of the SAT
and SAT II at the University of California. Educational Assessment, 8 (1), p. 1-26.
Rothstein, J. (n. d.) SAT scoring, high school, and collegiate performance predictions. Retrieved June from 31, 2011 from http://wwwjrothat/princeton.edu.
Nathan , J. S. and Camara, W. J. (1998), Score changes when retaking the SAT: I reasoning test (College Board Research No. 1998-05) Princeton, NJ: Educational Testing Services.
Stickler, L (2007), A critical review of the SAT: Menace or mid-mannered measure? TCNJ Journal of Student Scholarship, 9
Ting, S. R. (2000), Predicting Asian Americans, academic performance in the first year of college: An approach combinig SAT scores and non-cognitive variables. Journal of College Development 41, 442-449.