How Do We Measure Intelligence?

A Background on Intelligence Tests

The earliest recorded intelligence test was made by the Chinese emperor Ta Yu in 22,000 BC. It was devised to examine government officials, and was used as the basis for promoting or firing them. Today, intelligence testing forms part of the scope of Psychometry, a branch of Psychology specialized in psychological testing. Psychometrists create, administer, and interpret tests and their results. Psychometrists typically hold masters in Psychology, and underwent extensive courses on testing. They work in educational, business, and clinical settings.

Intelligence testing began to take a concrete form when Sir Francis Galton, an English psychologist considered to be the father of mental tests, demonstrated in the late 19th century that individuals systematically differ across then-known key components of intelligence (sensory, perceptual, and motor processes). Although the tests he designed resulted to very few significant findings, they nevertheless brought forth important questions about intelligence - questions that remain tantamount to the refinement of intelligence testing even to this day. These questions revolve around the methods of measurement, components, and heritability of intelligence.

Popular Intelligence Tests

Intelligence tests are classified in two ways - individual/group, aptitude/achievement. Aptitude tests measure the potential development of the testee, like a tool for predicting future performance. Achievement tests, on the other hand, measure mastery of a specific domain, and thus evaluate current performance. Abstract tests are mainly aptitude in type, while quarterly examinations in schools are mainly achievement in type. Intelligence tests also differ in how they are being administered, whether by individual or by group. Individual tests consider the interaction between the tester and the testee, and thus are more customized and personal. On the other hand, group tests are more economical, saving time, money and effort. Because group tests are more superficial than individual tests, they only serve supplemental basis when it comes to special placements. For example, the legal requirements for placing children in special education include group and individual tests, and additional information outside the testing situation. Intelligence tests today, however, are more complicated than these basic classifications. They can be mainly aptitude type, mainly achievement type, or both; and can be administered individually and/or by group.

Stanford-Binet Test. This test is given to individuals even as young as 2 years old and up to adulthood. The test consists of various items requiring verbal and nonverbal response. For example, a 6-year old child may be asked to define a list of words and also to trace a maze path. The test attempts to measure cognitive processes, such as memory, imagery, comprehension and judgment. The test also emphasizes the importance of considering age when administering and interpreting results. After collecting results from a large number of testees through the years, findings were found to approximate a normal distribution, or a bell curve, in which most testees fall in the middle scores (84-116) and about 2% scores greater than 132 and less than 68. The Stanford-Binet test started in 1904 when Alfred Binet was asked to devise a method to separate students who benefit from regular classroom instruction and those who need to be placed in special schools. Through the help of his student, Theophile Simon, Binet came up with 30 items and administered the test to 50 nonretarded children aged 3 to 11 years old. Based on the results of the test, Binet was able to identify the norm for mental age. For example, 6-year old Simon scored 20, which is the average score of 9-year old children; thus, his mental age is 9. In 1912, William Stern formulated the notion of intelligence quotient (IQ) in which he divided the mental age by the chronological age and multiplied it by 100. Based on this formula, he classified testees as "average", "above average", and "below average". Lastly, in 1985, Lewis Terman revised the original test in Stanford University, hence the name. His revisions include detailing and classifying insructions, adding other content areas (such as evaluating short-term memory, and verbal, quantitative and abstract reasoning), applying the concept of IQ in intepreting results, extensively calculating the norms, and identifying a general score for intelligence.

Weschler Scales. David Weschler developed different scales to measure intelligence. The first scale was developed in 1939. Today, it is known as the Weschler Adult Intelligence Scale - III (WAIS-III). Other scales he developed are the Weschler Intelligence Scale for Children - III (WISC-III), which is for children and adolescents aged 6 to 16 years old, and the Weschler Preschool and Primary Scale of Intelligence (WPPSI), which is for children aged 4 to 6.5 years old. All of Weschler's scales are divided into 6 verbal subscales and 5 nonverbal subscales. Norms developed from nonverbal subscales are astonishingly more representative than the abstract content of the Stanford-Binet test. Weschler scores his scales in two ways: there are specific IQ scores for all subscales and a general IQ score for the entire scale. Scoring is therefore more rigorous and possibly more accurate than the Stanford-Binet test.

Army Alpha Test and Army Beta Test (1917). The Army tests are the first intelligence tests that were administered in a group. The Army Alpha test is a written exam, whereas the Army Beta test is a performance exam given orally or verbally to illiterate recruits.

Scholastic Assessment Test (SAT). This test is taken by almost 1 million US high school seniors every year to gain passage to colleges and universities. The SAT is mostly an achievement test rather than an aptitude test, and is therefore oftentimes not viewed as an intelligence test. But it can still be considered an intelligence test because it measures and scores the same domain abilities - verbal and mathematical proficiency; and it was similarly developed to predict school success as with the Stanford-Binet test. Although the SAT is administered in a group, the result of the test is interpreted along with other individual measures, such as high school grades, quality of the high school, recommendation letters, personal interviews, and special activities or circumstances that benefited or impeded the testee's intellectual ability (e.g., contests, projects, outside involvements, etc.). Although the SAT is widely used as an index to accept high school students in colleges and universities, there have been many criticisms about it. First, research shows that simply coaching students on a short-term basis can dramatically raise their SAT scores by 15 points. Second, SAT scores appear to show that males score better than females by 42 points on average. Further review by the College Board shows that males tend to score 35 points more on mathematical contents and 8 points more on verbal contents than females. According to the Educational Testing Service, the SAT aims to predict first year grades, but actual data shows that females tend to outperform males during the first year. This could mean that the SAT is not equally reliable for both male and female testees.

Criteria for Intelligence Tests

A good intelligence test must be valid, reliable and standard.

Validity refers to how well the test accurately capture what it attempts to measure. For intelligence tests, that is "intelligence". For example, a test measuring language proficiency in itself cannot be considered an intelligence test because not all people proficient in a certain language are "intelligent", in a sense. Similarly, a test measuring mathematical ability need not include instructions using cryptic English. Validity can be established in two ways. First, there should be a representative sample of items across the entire domain of intelligence (i.e., not just mathematical abilities, but verbal skills as well). This is where Weschler scales seem to fare better than the Stanford-Binet test. Second, the results should match an external criterion. Common external criteria are educational achievements, career success, and wealth; that is, intelligent people are often achievers, whether in school, work, or finances.

Reliability refers to the stability and consistency of scores the intelligence test produces. For example, Peter took a random 50% sample of an intelligence test on his first year, and answered 75% of the test items correctly. Thereafter, Peter took the test year after year. Surprisingly, the results were inconsistent. He correctly answered 90% of the items in his second year, 40% of the items in his third year, and 60% in his first year. Meanwhile, Annie took the intelligence test every month in her first year, and the results seemed nonsense. Because the results vary significantly every retake, then the test loses its ability to be predictive of what it attempts to measure.

Standardization refers to the uniformity of administering and scoring the test. An intelligence test does not consist only of the test items; it includes the process in which the test is given and interpreted. For example, if the test requires an interview, all the interviewers should ask the same questions in the same way. Ideal standardization is, of course, impossible, but the test should attempt to eliminate certain factors that can compromise the test's reliability.

Cultural Bias in Intelligence Tests

Intelligence tests are traditionally biased with the dominant culture. Early intelligence tests consistently showed that urban testees scored better than rural testees, that middle-income test-takers fare better than low-income test-takers, and that White Americans get higher scores than African-Americans. For example, an early intelligence test asks children what would be the best thing to do when one finds a 3-year old child alone in the street. The correct answer then was to call the police. This is where the minority's perception differ. Most rural children have negative impressions about authority figures, including the police; most low-income children know where to look for a security guard, but not a policeman; and, the African-American culture allows children as young as 3 years old to roam about and explore their environment alone, and so most African-American children might not even understand what the question is all about.

Intelligence tests today attempt to minimize cultural bias by administering the test and adjusting the norms to and for a large and more representative sample of the population. However, intelligence tests continue to be biased toward the dominant group. There are some test-makers who attempt to go beyond "norms" adjustment, and actually modify or include test items from the minority group's domain expertise. These attempts make up the culture-fair tests - intelligence tests intended to be culturally unbiased. Two popular culture-fair tests are the Raven Progressive Matrices and the System of Multicultural Pluralistic Assessment (SOMPA). The Raven Progressive Matrices attempts to eradicate language barrier and cultural factors inherent to the language of the test by making the test simply nonverbal; however, results still show that the educated consistently score better than the illiterate. So far, the SOMPA is considered to be the most comprehensive of the culture-fair tests today. The SOMPA measures both verbal and nonverbal intelligence by utilizing WISC-III; it takes into consideration the testee's social and economic background by conducting a 1-hour interview with the parents; it factors in the testee's social adjustment in school by administering questionnaires to parents; and, it also identifies the testee's physical health by means of a medical examination.

Culture-fair tests today seem to reveal that intelligence tests do not accurately capture the notion of intelligence; rather, they simply reflect the priorities of the dominant culture.

The Misuse and Abuse of Intelligence Tests

Intelligence tests attempt objectivity by ideally meeting certain criteria - validity, reliability, and standardization; but, classifications, categorizations, and placements borne out of the norms of the test reinforce the same bias it attempted to eliminate in the first place. Anyone who took an IQ test knows how bad it feels when you are sandwiched between those who fared better and those who scored worse. Imagine how the lowest scorer would feel. No matter how interpreters attempt to objectify the meaning of the scores, we all have our own subjective understanding of how we did compared to others. And this have drastic repercussions. First, anyone who scores lower feels down, is ridiculed, and even feels more down. Second, anyone who scores higher feels elated, starts to ridicule others, and begins to exaggerate one's self. Self-prophecy kicks in, and the test, instead of being a form of measurement, becomes a variable that effects change to the test-takers. The basic example goes like this: People who scored low in the first test tend to be pessimistic and perform worse on the second test, whereas people who scored high on the first test tend to be optimistic and perform better on the second test.