Online psychology experiments: everything you need to know

Article by Ben Howell

Photo by rawpixel.com from Pexels

Photo by rawpixel.com from Pexels

Achieve better data quality, better statistical power, and reduce fraudulent behavior in research surveys and psychology experiments by using proven techniques from the scientific literature.

Armed with the knowledge, strategies, methods and contingencies discussed in this article, researchers can boost the quality of their studies using the power of online experimentation and research.

This article explores the advantages, limitations, and other issues surrounding online experimentation and suggests an array of practical, actionable solutions to some tricky problems when conducting research online.


Executive summary

  • Conducting behavioral experiments online, rather than in the lab, can provide better external validity for two important reasons: the more ecologically valid context, and the more varied participant pool.
  • Experimental artefacts, such as the way participants perceive hypotheses and change their behavior because of observer effects, are still of concern in online experiments. However, research has shown that even revealing hypotheses and providing overt hints as to the researchers' preferred responses has little effect on participants’ responses. More investigation is still needed on demand characteristics in online studies.
  • Prescreen lying can be detected quite effectively and data quality improved by combining a) a low-probability screening question near the start of a study, and b) an instructional manipulation check within the study itself.
  • To ensure engagement and reduce random responses, instructional manipulation checks can be used. However, the use of such checks may lead to a demand characteristic manifested as systematic thinking rather than providing natural reactions to stimuli. Such effects can lead to dramatically different experimental results, and so the nature of manipulation checks used should be considered with care.
  • Numerous studies show that participants from recruitment services produce very low rates of cheating, probably because of the associated high opportunity cost. Self-reporting of cheating is a useful aggregate measure of cheating behavior, and ‘commitment’-type questions have shown to have the greatest impact in reducing rates of cheating. Context, motivation, setting, and opportunity are significant factors in cheating behavior that are not yet well understood.
  • Dropout rates can be reduced by making experiments more enjoyable, by offering feedback on performance and research findings, and by asking for personal information. However, offering feedback can compromise the accuracy of results where participants are interested in self-insight. Financial incentives have also been shown to reduce dropout rate. However, they also increase the chance of multiple submissions, random responses and prescreen lying. Further research is required to investigate the prevalence and effects of dropout in online studies.
  • Reaction time experiments are ideal to conduct online. They have proven to be very accurate when measuring variance in both within-subjects and between-subjects studies. Only in very rare cases is specialist timing hardware required, such as when millisecond timing (rather than variability) is the subject of research, or where short stimulus presentation cycles (less than 50ms) are essential. Reaction time experiments are ideal to conduct online. They have proven to be very accurate when measuring variance in both within-subjects and between-subjects studies. Only in very rare cases is specialist timing hardware required, such as when millisecond timing (rather than variability) is the subject of research, or where short stimulus presentation cycles (less than 50ms) are essential.
  • Ethics, in particular, issues concerning informed consent, debriefing, feedback and financial incentives, is perhaps the least well understood subject in online research. Although most studies conducted online present minimal risk, the ethical qualities of those studies should still be carefully considered before engaging participants.

Contents


Advantages of online psychology experiments

There are multiple advantages to conducting behavioral experiments online. First and foremost, online experiments free the researcher from the constraints of the lab, resulting in a reduction in staff time and cost, freedom from scheduling of participants, and even from the need for a lab and participant computer equipment altogether. Researchers now have the ability to autonomously run studies concurrently, 24 hours a day, 7 days a week.

Online experimentation is one of the most promising avenues to advance replicable psychological science in the near future.

van Steenbergen & Bocanegra, 2015.

Automation via online experiments relieves the researcher from the role of interviewer, automates collection of data and delivers high statistical power by allowing recruitment of high numbers of participants from all over the world.

Ecological validity

Online experiments can usually be run on the devices that participants use in their own everyday lives. This means that online experiments can better reflect both the environmental and the technical conditions in the real world than experiments run in the lab. The higher ecological validity of online experiments leads to more confidence in the generalizability of findings across settings (Dandurand, Shultz, & Onishi, 2008; Logie & Maylor, 2009).

Population validity

Online experiments generally lead to relatively wide differences in participant demographics, robust random sampling (as participants self-enrol at their convenience) and large sample sizes. In contrast, lab-based experiments often end up with a participant pool comprised of a small convenience sample (e.g., university students and/or local residents) with reduced demographic diversity. Researchers can be more confident in the real-world generalizability of their findings in online contexts compared to lab-based experiments (Horswill & Coster, 2001).

Sample size and demographic diversity

Conducting experiments online allows access to vast numbers of participants from a wide variety of cultural, socio-economic and ethnic backgrounds from across the world. Specific populations can be targeted via online forums, newsgroups, social media, crowd sourcing and specialized participant recruitment services. Many participant recruitment services allow screening and customizable restrictions and eligibility criteria to provide accurate targeting, affording the researcher a homogeneous population whilst maintaining heterogeneity in non-target demographics (Paolacci, Chandler, & Ipeirotis, 2010). However, it is important to keep in mind the higher chance of nonnaivete when using participant recruitment services rather than recruiting participants by more direct means (Chandler, Mueller, & Paolacci, 2013).

Scalability

By running studies with an online experiment platform (e.g., Psychstudio, 2019), the researcher can sample multiple participants concurrently, 24/7, anywhere in the world. This frees the researcher from logistical constraints such as equipment, lab space and staff availability. Experiments may also be optimised to test one or a small number of trials or conditions across thousands of participants rather than a large number of trials across a small number of participants.

Fast results

Online experiments allow multiple participants to engage in the study concurrently without scheduling of lab equipment. Using online recruitment services, a researcher can gather results from hundreds of participants in as little as a few hours (Crump, McDonnell & Gureckis, 2013). This represents the potential to save days or weeks of testing time compared to running lab-based experiments.

Simplified replication

Online experiment platforms that encourage sharing and facilitate cloning of studies can greatly ease replication by fellow researchers. Lab-based experiments can suffer from homogeneity of both participants (i.e., lack of population validity) and setting, because of their highly constrained environment (i.e., lack of ecological validity) leading to a need to replicate projects across many labs (van Steenbergen & Bocanegra, 2015). Online experiments need not take those moderating factors into account because they facilitate increased variability in population and setting.

Voluntary participation

Both lab-based and online studies can suffer from a lack of voluntary participation due to factors related to situational demand. Students, for example, may feel pressure to participate for credit or credibility purposes, or for feelings of obligation and various types of demand characteristics. Participants are less likely to withdraw consent after reading a consent form or to cease participation (thus withdrawing consent) during an experiment where a researcher is present. However, the anonymity of participants in online studies can help to encourage participants to exercise their free will, but this also leads to a greater dropout rate (Dandurand et al., 2008).


Limitations and how to address them

Along with the obvious advantages that come with conducting psychological experiments online, there are also various limitations. One potential disadvantage of online data collection is the experimenter’s lack of control over the setting in which participants provide their responses. This is important to consider in cases where noise, distraction, lighting and capability of computing equipment could affect the experimental outcomes.

In a web‐based, sexual behavior risk study using a rigorous response validation protocol, we identified 124 invalid responses out of 1,150 total (11% rejection). Nearly all of these (119) were due to repeat survey submissions from the same participants, and 65 of them came from a single participant.

Konstan, Simon Rosser, Ross, Stanton & Edwards, 2005.

Ethical issues including difficulty in obtaining informed consent, providing feedback and debriefing need careful consideration, and appropriate measures applied to counter fraudulent behavior, multiple submissions, self-selection and dropout.

Prescreen lies

With the rise in popularity of crowd sourcing and participant recruitment services, researchers can focus on specific traits, sub-populations and experiential factors when prescreening participants. Some participants however, may deliberately enter false information in prescreening in order to gain access to a study when financial incentives are on offer. Evidence suggests that prescreen lying is not uncommon (Jones, House, & Gao, 2015) and the incidence increases with the rarity of population sought. This increase in fraud is due to the low number of genuinely eligible participants and therefore low competition where financial reward is offered. Even minimal rates of prescreen fraud can have a significant effect on sampling error, particularly when fraudulent participants are truthful in the study proper (Chandler & Paolacci, 2017).

Teitcher et al. (2015) suggest a number of techniques to reduce the incidence of false prescreen responses:

  • State that participants will not be compensated if fraudulent behavior is detected.
  • Give compensation information only at the end of a study.
  • Inform participants of their eligibility for the study only after completion of a prescreening questionnaire (without revealing eligibility criteria).
  • Provide entry into a prize draw or lottery as an incentive instead of paying each participant.

Whilst these methods can deter fraudulent actors from taking part in a study, they may also deter legitimate participants from doing so. Other options for preventing fraudulent participation include repeating questions in prescreening (e.g., by including reverse-worded questions), using a recruitment service that performs prescreening and fraud detection, and using screening questions that would require research in order to provide a fraudulent answer (e.g., brand of eye-glasses frame and prescription strength), thus inducing unacceptable opportunity cost for participants.

In addition to prevention, Jones et al. (2015) suggest using the following techniques to help detect prescreening fraud:

  • Consistency check: Two or more questions that can reveal inconsistencies in participant responses. For example, a prescreening question that asks the age of the participant and another later in the study that asks for their year of birth.
  • Low-probability screening question: A question that contains response options that are likely false. Multiple choice questions that allow multiple selections can be used to estimate a range of fraud probabilities where the probability increases with the number of response options selected. For example, Jones et al. (2015) posed a question asking about fruit purchases in the past year with the options of fresh muscadine grapes, fresh goji berries and fresh red currants. Since it is extremely unlikely that a participant would have actually purchased more than one of the options, the researchers applied a medium probability of fraud if respondents selected at least two of three options, and high probability of fraud if all three were selected.

Jones et al. (2015) tested these techniques with participants recruited from Amazon's Mechanical Turk. They showed that participants who selected the most options in the low-probability screening question performed more poorly in rationality and engagement metrics than participants who selected fewer options. Analysis suggested that some experienced respondents, in pursuit of financial reward, may have tried to maximize their chances of entering the study by selecting more options in prescreening.

The researchers concluded that the most effective strategy for detecting prescreen lying and to increase data quality, was to include a low-probability screening question at the start of a study and an instructional manipulation check within the study itself.

Random responses

Online studies, particularly questionnaires, are vulnerable to random responses (Chandler & Paolacci, 2017; Clifford & Jerit, 2014) and the likelihood increases in line with financial rewards. In a language translation study with participants from a participant recruitment service, Callison-Burch (2009) discovered that responses from the most experienced participant were little better than chance which they interpreted as a strong indication of random responding. It is speculated that low rates of cheating amongst participants from participant recruitment services (who are paid per task), are due to the opportunity cost of looking up answers (Clifford & Jerit, 2016). Plausibly, the opportunity cost hypothesis could be extended to cover random responses. Another potential source of random responses is participation from non-humans, in the form of computer programs (aka bots or automated form fillers). Non-human participation is becoming more prevalent as the number of online studies offering financial incentives increases (Dennis, Goodson, & Pearson, 2018). These programs are plentiful, easy to use and freely available to download from the internet (Buchanan & Scofield, 2018). Even low numbers of random responses – whether from humans or bots – can lead to large distortions in results (Credé, 2010).

Dupuis, Meier, and Cuneo (2018) tested a number of techniques for detecting random response sets and recommend the following three indices based on ease of implementation and effectiveness:

  • Response coherence: Correlative index that indicates whether responses to a questionnaire are clear and understandable.
  • Mahalanobis distance: Outlier detection index that indicates the distance between one response set and the collection of all other response sets.
  • Person–total correlation: Correlative index that measures the difference between the sum of scores on each item response and the mean score of the item responses across all other response sets.

However, as previously noted by Downs, Holbrook, Sheng, and Cranor (2010), the methods above are non-optimal when many participants supply random responses.

The "honeypot" method is an effective approach for detecting automated form fillers in web pages. It is commonly used by forums and blogs for protection against spam bots. This method employs hidden form fields which are invisible to human participants but still visible to form fillers and bots (as they read the page code rather than the content produced by that code). If the field is not left empty by the participant, then it is likely that they were a bot.

Prescreening using manipulation checks (Crump et al., 2013) and simple cognitive tasks such as CAPTCHA questions (e.g., von Ahn et al., 2003) can help detect and prevent random responders from participating in a study (Liu & Wronski, 2018). However, if random response participants are nonnaive to this strategy, they could simply focus on the prescreening checks before going on to randomly respond to the rest of the study. Including screening checks and tasks periodically throughout the study can help prevent this from occurring (Downs et al., 2010).

Author's note: CAPTCHA is considered an obsolete technology as it is annoying for users, it is hard or impossible for those with accessibility issues, and it is becoming much easier for bots because of advances in technology. Current best practice is to use reCAPTCHA instead.

Multiple submissions

Early studies suggested that multiple submissions by the same participant in online experiments were rare (e.g., Reips, 2002). However, later research shows that the problem is of much greater concern than first thought. In online experiments the possibility of multiple submissions can rise when there is entertainment value to participation (Ruppertsberg, Givaty, Van Veen, & Bülthoff, 2001), when the participant has a bias that is exercised by the experiment, or when some form of incentive is on offer, such as financial gain or entry in a prize draw (Konstan et al., 2005).

A study conducted by Bowen, Daniel, Williams and Baird (2008) concluded that participants offered financial incentives were six times more likely to submit multiple responses compared to those who were not offered compensation. A meta-analysis conducted by Teitcher et al. (2015) found the frequency of multiple submissions in five sexual health studies to be between 8% and 33%.

Konstan et al. (2005) suggest using the following identifiers to detect multiple submissions:

  • Duplicate IP address.
  • Duplicate email address.
  • Duplicate name.
  • Duplicate payment information.

Other options for prevention and detection include providing each participant with a unique login/password (Reips, 2002), providing each participant with a unique URL, and rejecting submissions that originate from the same geolocation (Kraut et al., 2004). The geolocation technique can be enhanced by comparing the difference between submission times. However, false positives can still occur using this technique, for example, when two family members sequentially complete the same study using the same device, or when students are participating in the study from a computer lab (Teitcher et al., 2015). Another option is to use participant recruitment services that provide participant trust metrics as eligibility criteria (Peer, Vosgerau, & Acquisti, 2013), multiple submission prevention and fraud detection.

Cheating

Without supervision, there's more opportunity for participants to cheat. A participant might cheat by taking written notes in a prospective memory task or use web search to look up answers when participating in a knowledge-based questionnaire. Some researchers suggest that cheating might not be a major concern in online research (Finley & Penningroth, 2015). However, reality is a little more nuanced, as propensity to cheat appears to depend more on factors such as context and motivation than simply on whether a study is conducted online or in the lab.

Clifford and Jerit (2014) conducted an experiment testing political knowledge of students using a between-subjects design with random allocation to either an online group or an in-lab group. Students in the online group performed significantly better in the knowledge-based tasks than students from the in-lab group. This result led the researchers to conclude that cheating was more prevalent in the online group.

In a subsequent study, Clifford and Jerit (2016) examined self-reported cheating in an online, between-groups experiment. There were 3 groups tested: a student group, a group from a participant recruitment service, and an online panel consisting of university campus and government staff (this group is not discussed here because of its small sample size). Over 4 experiments, the researchers found that rates of self-reported cheating for the recruitment service participants ranged from 4% to 7%, and from 24% to 41% for the student group. Participants who self-reported cheating were found to have spent much longer answering knowledge questions, which, together with the effect size observed, is indicative of having sought outside assistance.

Among the student samples, rates of self-reported cheating are relatively high, ranging from 24% to 41%

Clifford & Jerit, 2016.

With a control (no technique employed) self-reported cheating rate of 24% to 41% amongst the student group, Clifford and Jerit (2016) investigated the following techniques to reduce cheating rates:

  • Direct request: Telling participants not to cheat. For example, "Please give your best guess and do NOT use outside sources like the Internet to search for the correct answer."
  • Timers: Question timeouts that limit the time participants have to provide an answer. For example, "Please do NOT use outside sources like the Internet to search for the correct answer. You will have 30 seconds to answer each question."
  • Commitment: Asking the participant for a commitment not to cheat (requiring a yes or no response). For example, "It is important to us that you do NOT use outside sources like the Internet to search for the correct answer. Will you answer the following questions without help from outside sources?"

Of the methods listed above, direct request performed least well, resulting in a self-reported cheating rate of about 22%. Timers were significantly better, with a self-reported cheating rate of about 16%. However, the most effective method tested was the commitment technique, with a self-reported cheating rate of only 9.5%. Knowledge scores were significantly lower for all techniques used compared to the control. The recruitment service group, with a control self-reported cheating rate of 6.6%, showed a self-reported cheating rate of 1.5% for the commitment technique (direct request and timers were not tested). However, no statistically significant decrease in knowledge scores was found. Low rates of cheating attributed to this group reinforces similar findings from other research (Berinsky, Huber, & Lenz, 2012; Chandler & Paolacci, 2017; Germine et al., 2012).

In sum, context, opportunity and motivation all contribute to different rates of cheating between groups and across settings. It is speculated that cheating is low in recruitment service participant pools due to the opportunity cost. Participants are paid per experiment and as such, it is too costly to cheat. Students on the other hand, have little opportunity cost and are more likely to be susceptible to self-deceptive enhancement in online settings leading to higher rates of cheating. Providing participants with the ability to self-report cheating is recommended as a useful aggregate measure of cheating behavior (Clifford & Jerit 2016).

Participants can be genuinely anonymous in online studies which makes it difficult to establish whether or not they are truly informed. Even when online participation is not anonymous, it can be hard to ascertain that participants are who they say they are. Vulnerable groups, including children, do not always have the ability to give consent. It is particularly difficult to confirm the age of participants who are self-reporting their demographic information, and thus there can be ambiguity around whether someone really had the capacity to give informed consent (Kraut et al., 2004).

Self-selection

Self-selection is a problem for any study where participation is voluntary and anonymous for reasons such as personal interest in a topic or financial or other incentives. Bias may also be introduced by self-selection related to where, when and to whom a study is advertised (Khazaal et al., 2014).

Caution is needed in the interpretation of studies based on online surveys that used a self-selection recruitment procedure. Epidemiological evidence on the reduced representativeness of sample of online surveys is warranted.

Khazaal et al., 2014.

A less obvious issue is that of coverage bias, whereby there is a discrepancy between those who have access to the experiment and those who do not (e.g., the bias introduced between those with and without internet access). Bias may also be introduced when advertising on social media platforms because of the specific demographics of those who use such platforms and those who may be targeted by study advertisements on such platforms (Fenner et al., 2012). The effect of bias in self-selection on the data recorded tends to be hard to estimate for online experiments because of a lack of information about non-participants (Bethlehem, 2010). However, the multiple site entry technique is one method that could be used for reducing the bias potential of self-selection (Hiskey & Troop, 2002; Reips, 2002).

Dropout

Due to the nature of people’s use of the web (including casual browsing, split attention, and anonymity), dropout rates tend to be much larger in online than in lab-based experiments. Studies with higher levels of difficulty and/or significant time commitments are prone to increased dropout rates (Dandurand et al., 2008). The likelihood of participant dropout may be reduced via a number of techniques. These include making experiments more visually appealing and more enjoyable (Crump et al., 2013), offering timely feedback on individual performance and overall research findings (Michalak & Szabo, 1998), and perhaps surprisingly, by asking for personal information (e.g., age, gender, address) before the experiment begins (Frick, Bächtiger, & Reips, 2001). Financial incentives have also been shown to reduce dropout rate (Crump et al., 2013) with meta-analysis showing such incentives can lead to improvements in completion rates of web surveys of up to 27% (Göritz, 2006).

Financial incentives have been shown to increase the chance of multiple submissions (Konstan et al., 2005), random responses and prescreen lying (Chandler & Paolacci, 2017). Similarly, offering feedback can compromise the accuracy of results where participants are interested in self-insight, caused by sensitivity to their own responses (Clifford & Jerit, 2015). Dropout is also problematic in longitudinal studies. However, the literature is lacking in investigation of the prevalence and effects of dropout in online studies.

Engagement and distraction

For lengthy and/or complex tasks, the chance of data loss attributable to participant distraction, forgetfulness and inattention may occur to a larger degree in online experiments than in lab-based experiments (Finley & Penningroth, 2015).

For the complex task of remembering to perform a prospective memory response during an ongoing task, data loss was significantly greater for online participants than for lab-based participants.

Finley & Penningroth, 2015.

In order to reduce cognitive effort, some participants may satisfice (Krosnick, 1991). For example, participants may skim questions or instructions, or not read them at all, leading to sub-optimal, best-guess or untruthful responses. These may be harder to detect than participants who are clearly outliers (Oppenheimer, Meyvis, & Davidenko, 2009).

A number of strategies can be employed in order to test engagement and comprehension, these include giving feedback on performance after each trial or block of trials, conducting manipulation and comprehension checks of instructions, and increasing incentives (e.g., financial). In some cases, comprehension checks of instructions are much more effective than increases in payment for improving data quality. Greater comprehension and engagement, and thus better data quality, can also be attained by employing manipulation checks within instructions and only allowing progression of the experiment after correct answers are given (Crump et al., 2013). Where comprehension is not likely to be a concern, such as in simple experiments and questionnaires, instructional manipulation checks combined with a priori exclusion criteria can be used to gauge participant engagement. For example, a researcher might deliberately include a survey question that contains its own answer, or provide an explicit instruction on how to answer the question (despite the premise of the question). If the participant still fails to answer the question correctly, they can be excluded from the experiment or their data can be discarded (Oppenheimer et al., 2009), described by Crump et al., (2013) as an insidious gotcha question.

Use of instructional manipulation checks can also lead to a demand characteristic effect if participants feel they need to tread carefully (i.e. "It's a trap!"). This effect is manifested as participants engaging in systematic thinking rather than providing natural reactions to stimuli, because participants are more likely to analyze the purpose of a task after being subjected to instructional manipulation checks. Hauser and Schwarz (2015) showed this to be the case in a cognitive reflection test (e.g., Frederick, 2005). Task scores were higher and response times longer (leading to better performance) for participants whose instructional manipulation checks were conducted before the cognitive reflection test, compared to participants who did the instructional manipulation checks afterwards.

Supervised assessment

A possible source of problems in online studies is that there is no researcher present to clarify in the task for participants who misunderstand or misinterpret the instructions, or who want to ask questions before they begin. These problems can be alleviated somewhat by pre-testing or offering some form of two-way dialogue between researcher and participant for clarification (Michalak & Szabo, 1998). However, this may prove to be logistically difficult in an online setting.

Debriefing and feedback

Without a researcher present, debriefing in an online setting presents a challenge. Presentation of debriefing notes and feedback at the end of an experiment initially seems plausible, but these measures are only effective where the participant understands the debriefing and feedback, and has not been adversely affected by the experimental process. Unless the researcher specifically contacts the participant (or vice versa) as part of a routine follow-up there is no way for the researcher to identify or act upon any signs of experiment-related distress, confusion or misunderstanding experienced by the participant. Furthermore, participants who abandon an online study before completion may never reach the debriefing and feedback stage. If the early abandonment is due to an adverse effect of participation, personalized debriefing and feedback is all the more important.

Some methods for providing debriefing notes and feedback are suggested by Nosek, Banaji, and Greenwald (2002):

  • Include an ever-present "Abandon study" button or link that can provide debriefing material and feedback when clicked.
  • Detect when a participant tries to close the browser tab and provide debriefing and feedback at that point.
  • Require participants enter an email address before starting the experiment so that debriefing documentation and feedback can be sent to the participant.
Of the methods listed above, the Abandon study button option proves to be the most flexible. This is because it provides the participant with an easy and recognized way to leave the study, it maintains participant anonymity, and it allows for personalization of feedback and follow-up questions to better ascertain the state of the participant for debriefing.


Demand characteristics

Experimental artefacts which cause participants to alter their behavior can change the results of an experiment. These artefacts can include subtle cues and subconscious influence by the experimenter (called the observer-expectancy effect), or instructions, procedures or trials that lead the participant to interpret the purpose or hypothesis of the experiment. Observer expectancy may occur even without the presence of the experimenter, for example, in the written form of the instructions, or the progression of the experiment itself. Demand characteristics may also be produced by rumor and hearsay about the experiment, especially in class groups. However, the likelihood of verbal and physical communication or influence is reduced when no experimenter is physically present.

It has been widely cited in the literature (e.g., Reips, 2002) that online studies decrease the likelihood of demand characteristics because of uniformity of execution compared to lab-based studies. However, to date there appears to be little or no evidence for this claim.

It is clear that dedicated studies of non-laboratory applications of demand characteristics have not produced a body of work that can contribute substantially to the wider study of research participation effects... Perhaps the time is ripe or overdue for genuinely multi-disciplinary studies of these phenomena.

McCambridge, de Bruin & Witton, 2012.

Experimental artefacts such as perceived hypotheses and behavioural change attributable to observer effects are still concerns in online studies (McCambridge et al., 2012). However, experiments that have tested such effects have shown little evidence of their impact (White, Strezhnev, Lucas, Kruszewska, & Huff, 2018) even when exposing hypotheses and providing overt hints as to the researchers preferred responses (Mummolo & Peterson, 2018).

Despite the promising data thus far, there is still a large gap in the literature concerning demand characteristics in online studies.


Timing precision and accuracy

Precision of clocks in software driven experiments both in-lab and online is frequently discussed. The fact is, clocks vary from machine to machine and timing accuracy in experiments can be affected by both the hardware and software environments in which those experiments are run. A typical computer keyboard and mouse have a polling rate of 125hz (~8ms per poll) and computer monitors can run as low as 60hz (~16.667ms per frame), and as high as 240Hz (~4.167ms per frame). Additionally, there are many other hardware and software factors that can introduce inaccuracy and variability to stimulus presentation and response time measurements.

On the face of it, these various sources of inaccuracy may appear to be an issue of concern when measuring reaction times (Plant, 2015). It is true that when absolute values of those measurements are the subject of investigation, the concern may be valid (Chetverikov & Upravitelev, 2015). Inaccuracies introduced by variability of clocks and polling rates on hardware devices (such as keyboards) may also have an effect where there is a lack of statistical power (Damian, 2010). However, in the vast majority of cases, where the variability of response time between and/or within participants is important, the variability and inaccuracy of timing in software-based experiments is negligible when compared to the variability in response timing between and within participants (Ulrich & Giray, 1989).

The time resolution of a reaction time clock has almost no effect on detecting mean reaction time differences even if the time resolution is about 30ms or worse.

Ulrich & Giray, 1989.

Where reaction times between lab-based and online participants have been compared, the differences in timing had no effect on the standard deviation between reaction times (de Leeuw & Motz, 2015; Reimers & Stewart, 2007).

Online studies are an adequate source of response times data, comparable to a common keyboard-based setup in the laboratory. Higher sample sizes and ecological validity could even make such studies preferable.

Chetverikov & Upravitelev, 2015.

In an online visual cuing experiment, small reaction time effects (~20ms) were accurately measured despite unknown variation between actual and intended timing of stimulus presentation, stimulus onset asynchrony and response-times (Crump et al., 2013). Crump et al. concluded that only in very specific and rare cases, in which it was important to ensure stimulus presentation duration be kept below 50ms, would specialist hardware be required.


Ethics

The subject of ethics is a crucial one for all human research, both online and offline. However, the reach of online technology means that ethical issues are particularly important to consider for online studies, especially when secondary data are used (e.g., data collected by internet companies). As touched on earlier, it is difficult to be entirely confident about informed consent provided by an otherwise anonymous participant. However, of potentially greater concern is the type of informed consent known as effective consent. Effective consent is given when a participant voluntarily and autonomously engages in an activity, is said to be competent in making that decision and understands the consequences of that engagement (Faden & Beauchamp, 1986). In the case of technology, informed consent is commonly described in a terms of service (TOS) agreement. By using the technology (software, website, service, etc.) the user agrees that using the service is an acknowledgement of consent. This implicit agreement and assumption of effective consent raises some serious ethical questions.

A case in point is the now infamous Facebook and Cornell University massive-scale emotional contagion experiment (Kramer, Guillory, & Hancock, 2014). The researchers manipulated the news feeds of almost 700,000 unsuspecting users to display differing ratios of positive and negative stories. They then analysed those users’ ongoing behavior to determine if their activity on the social network was affected by such manipulation. Such was the backlash and criticism of the ethical standards used in the research following publication, that the journal editor published an impassioned plea in the journal, Nature, defending the researchers’ position on the absence of informed consent in the study. She put forward the argument that researchers should not be held to higher ethical standards than that of commercial business (Meyer, 2014). However, it is widely argued that this experiment did not have adequate ethical oversight. Flick (2015) provides an analysis of the ethical problems with this study, as well as recommendations for internet research ethics and informed consent.

Although approval by an institutional review board was not legally required for this study, it would have been better for everyone involved had the researchers sought ethics review and debriefed participants afterwards.

Meyer, 2014.

Some researchers suggest that online experiments have a lower risk of harm than lab experiments, because the social pressure of withdrawal is reduced. However, this is a small advantage in the overall context of harm. As is the case with in-lab experiments, online participants may be reminded of traumatic events, be subjected to unpleasant or disturbing issues, or learn something undesirable about themselves, all of which can lead to mental distress. The researcher should debrief the participant and provide feedback as soon as practicable, and if deception was used, he or she should explain the nature and purpose of that deception. Further, if researchers become aware that a participant has been harmed in any way during the course of the experiment, they must take reasonable action to remediate (Kraut et al., 2004). With these issues in mind, debriefing and feedback becomes a complex ethical problem. In situations where a participant is harmed or deceived, the debriefing process may be inadequate as it is inherently impersonal and unsupervised.

In sum, many ethical considerations must be taken into account when conducting online experiments, including consent, harm, deception, disclosure, financial incentives, debriefing and feedback.

Author's note: Fair compensation for participants engaged via recruitment services is emerging as a hot topic in ethical debate. This debate looks likely to lead to widespread change in how employment, participation, incentive, reward and compensation are defined when applied to participation in online experiments.


Experiment design

Not all studies in psychology need be truly experimental. Quasi-experiments, phenomenology, case studies, field studies, qualitative research, and many more study designs can be conducted online and most of the discussion in this article still applies to those designs. To better determine the design of a study, or to explain a study to a colleague or class, high-level designs can be produced simply with pen and paper using research design notation.


Conclusion

As this article shows, many advantages of online research can also be limitations (and vice versa), depending on the experimental context. It is important to remember that many of the problems discussed can be avoided by knowing what the pitfalls are and applying the techniques outlined in this article and in the references cited.

Behavioral science is on the verge of a revolution in open experimentation, with large participant pools, wide demographic diversity, access to fast results, automation and replication never possible before the rise of the ubiquitous internet. Researchers armed with the right knowledge, strategies and contingencies can take their studies to the next level using the power of online experimentation and research.


References

  1. Berinsky, A., Huber, G., & Lenz, G. (2012). Using Mechanical Turk as a subject recruitment tool for experimental research. Political Analysis 20(3), 351–368.
  2. Bethlehem, J. (2010). Selection bias in Web surveys. International Statistical Review, 78(2), 161–188. doi: 10.1111/j.1751-5823.2010.00112.x
  3. Bowen, A., Daniel, C., Williams, M., & Baird, G. (2008). Identifying multiple submissions in internet research: Preserving data integrity. AIDS and Behavior, 12(6), 964–973. doi: 10.1007/s10461-007-9352-2
  4. Buchanan, E., & Scofield, J. (2018). Methods to detect low quality data and its implication for psychological research. Behavior Research Methods, 50(6), 2586–2596. doi: 10.3758/s13428-018-1035-6
  5. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk, In Proceedings of EMNLP 2009, ACL and AFNLP (2009), 286–295.
  6. Chandler, J., Mueller, P., & Paolacci, G. (2013). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods, 46(1), 112–130. doi: 10.3758/s13428-013-0365-7
  7. Chandler, J., & Paolacci, G. (2017). Lie for a dime: When most prescreening responses are honest but most study participants are impostors. Social Psychological and Personality Science, 8(5), 500–508. doi: 10.1177/1948550617698203
  8. Chetverikov, A., & Upravitelev, P. (2015). Online versus offline: The Web as a medium for response time data collection. Behavior Research Methods, 48(3), 1086–1099. doi: 10.3758/s13428-015-0632-x
  9. Clifford, S., & Jerit, J. (2014). Is there a cost to convenience? An experimental comparison of data quality in laboratory and online studies. Journal of Experimental Political Science, 1(2), 120–131. doi: 10.1017/xps.2014.5
  10. Clifford, S., & Jerit, J. (2015). Do attempts to improve respondent attention increase social desirability bias? Public Opinion Quarterly, 79(3), 790–802. doi: 10.1093/poq/nfv027
  11. Clifford, S., & Jerit, J. (2016). Cheating on political knowledge questions in online surveys. Public Opinion Quarterly, 80(4), 858–887. doi: 10.1093/poq/nfw030
  12. Credé, M. (2010). Random responding as a threat to the validity of effect size estimates in correlational research. Educational and Psychological Measurement, 70, 596–612. doi: 10.1177/0013164410366686
  13. Crump, M., McDonnell, J., & Gureckis, T. (2013). Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research. PLoS ONE, 8(3), e57410. doi: 10.1371/journal.pone.0057410
  14. Damian, M. (2010). Does variability in human performance outweigh imprecision in response devices such as computer keyboards? Behavior Research Methods, 42(1), 205–211. doi: 10.3758/BRM.42.1.205
  15. Dandurand, F., Shultz, T., & Onishi, K. (2008). Comparing online and lab methods in a problem-solving experiment. Behavior Research Methods, 40(2), 428–434. doi: 10.3758/brm.40.2.428
  16. Dennis, S., Goodson, B., & Pearson, C. (2018). MTurk workers’ use of low-cost 'virtual private servers' to circumvent screening methods: A research note. SSRN Electronic Journal. doi: 10.2139/ssrn.3233954
  17. de Leeuw, J., & Motz, B. (2015). Psychophysics in a Web browser? Comparing response times collected with JavaScript and Psychophysics Toolbox in a visual search task. Behavior Research Methods, 48(1), 1–12. doi: 10.3758/s13428-015-0567-2
  18. Downs, J., Holbrook, M., Sheng, S., & Cranor, L. (2010). Are your participants gaming the system? Screening Mechanical Turk workers. In Proceedings of the 28th international conference on human factors in computing systems. 2399–2402. doi: 10.1145/1753326.1753688
  19. Dupuis, M., Meier, E., & Cuneo, F. (2018). Detecting computer-generated random responding in questionnaire-based data: A comparison of seven indices. Behavior Research Methods, 1–10. doi: 10.3758/s13428-018-1103-y
  20. Faden, R., Beauchamp, T., & King, N. (1986). History and theory of informed consent. Oxford University Press, Incorporated.
  21. Fenner, Y., Garland, S., Moore, E., Jayasinghe, Y., Fletcher, A., & Tabrizi, S. et al. (2012). Web-based recruiting for health research using a social networking site: An exploratory study. Journal of Medical Internet Research, 14(1), e20. doi: 10.2196/jmir.1978
  22. Finley, A., & Penningroth, S. (2015). Online versus in-lab: Pros and cons of an online prospective memory experiment. In A. M.Columbus (Ed.), Advances in psychology research (pp. 135–161). New York, NY: Nova.
  23. Flick, C. (2015). Informed consent and the Facebook emotional manipulation study. Research Ethics, 12(1), 14–28. doi: 10.1177/1747016115599568
  24. Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25–42. doi: 10.1257/089533005775196732
  25. Frick, A., Bächtiger, M., & Reips, U. (2001). Financial incentives, personal information and drop-out in online studies. In U.-D. Reips & M. Bosnjak (Eds.), Dimensions of internet science (pp. 209–219). Lengerich, Germany: Pabst.
  26. Germine, L., Nakayama, K., Duchaine, B., Chabris, C., Chatterjee, G., & Wilmer, J. (2012). Is the Web as good as the lab? Comparable performance from Web and lab in cognitive/perceptual experiments. Psychonomic Bulletin & Review, 19(5), 847–857. doi: 10.3758/s13423-012-0296-9
  27. Göritz, A. (2006). Incentives in Web studies: Methodological issues and a review. International Journal of Internet Science, 1, 58–70.
  28. Hauser, D., & Schwarz, N. (2015). It’s a trap! Instructional manipulation checks prompt systematic thinking on “tricky” tasks. SAGE Open, 5(2), 215824401558461. doi: 10.1177/2158244015584617
  29. Hiskey, S., & Troop, N. (2002). Online longitudinal survey research: Viability and participation. Social Science Computer Review, 20(3), 250–259. doi: 10.1177/08939302020003003
  30. Horswill, M., & Coster, M. (2001). User-controlled photographic animations, photograph-based questions, and questionnaires: Three Internet-based instruments for measuring drivers’ risk-taking behavior. Behavior Research Methods, Instruments, & Computers, 33(1), 46–58. doi: 10.3758/bf03195346
  31. Jones, M., House, L., & Gao, Z. (2015). Respondent screening and revealed preference axioms: Testing quarantining methods for enhanced data quality in Web panel surveys. Public Opinion Quarterly, 79(3), 687–709. doi: 10.1093/poq/nfv015
  32. Khazaal, Y., van Singer, M., Chatton, A., Achab, S., Zullino, D., & Rothen, S. et al. (2014). Does self-selection affect samples’ representativeness in online surveys? An investigation in online video game research. Journal of Medical Internet Research, 16(7), e164. doi: 10.2196/jmir.2759
  33. Konstan, J., Simon Rosser, B., Ross, M., Stanton, J., & Edwards, W. (2005). The story of subject naught: A cautionary but optimistic tale of Internet survey research. Journal of Computer-Mediated Communication, 10. doi: 10.1111/j.1083-6101.2005.tb00248.x
  34. Kramer, A., Guillory, J., & Hancock, J. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24), 8788-8790. doi: 10.1073/pnas.1320040111
  35. Kraut, R., Olson, J., Banaji, M., Bruckman, A., Cohen, J., & Couper, M. (2004). Psychological research online: Report of board of scientific affairs' advisory group on the conduct of research on the Internet. American Psychologist, 59(2), 105–117. doi: 10.1037/0003-066x.59.2.105
  36. Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology, 5, 213–236.
  37. Liu, M., & Wronski, L. (2018). Trap questions in online surveys: Results from three web survey experiments. International Journal of Market Research, 60(1), 32–49. doi: 10.1177/1470785317744856
  38. Logie, R., & Maylor, E. (2009). An Internet study of prospective memory across adulthood. Psychology and Aging, 24(3), 767–774. doi: 10.1037/a0015479
  39. McCambridge, J., de Bruin, M., & Witton, J. (2012). The effects of demand characteristics on research participant behaviours in non-laboratory settings: A systematic review. PLoS ONE, 7(6), e39116. doi: 10.1371/journal.pone.0039116
  40. Meyer, M. (2014). Misjudgements will drive social trials underground. Nature, 511(7509), 265–265. doi: 10.1038/511265a
  41. Michalak, E., & Szabo, A. (1998). Guidelines for Internet research. European Psychologist, 3(1), 70–75. doi: 10.1027/1016-9040.3.1.70
  42. Mummolo, J., & Peterson, E. (2018). Demand effects in survey experiments: An empirical assessment. American Political Science Review, 113(2), 517–529. doi: 10.1017/s0003055418000837
  43. Nosek, B., Banaji, M., & Greenwald, A. (2002). E-research: Ethics, security, design, and control in psychological research on the Internet. Journal of Social Issues, 58(1), 161–176. doi: 10.1111/1540-4560.00254
  44. Oppenheimer, D., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45(4), 867–872. doi: 10.1016/j.jesp.2009.03.009
  45. Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.
  46. Peer, E., Vosgerau, J., & Acquisti, A. (2013). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavior Research Methods, 46(4), 1023-1031. doi: 10.3758/s13428-013-0434-y
  47. Plant, R. (2015). A reminder on millisecond timing accuracy and potential replication failure in computer-based psychology experiments: An open letter. Behavior Research Methods, 48(1), 408–411. doi: 10.3758/s13428-015-0577-0
  48. Psychstudio (2019). Psychstudio (Version 2019) [Computer software]. Australia. Retrieved from https://www.psychstudio.com
  49. Reimers, S., & Stewart, N. (2007). Adobe Flash as a medium for online experimentation: a test of reaction time measurement capabilities. Behavior Research Methods, 39(3), 365–70. doi: 10.3758/bf03193004
  50. Reips, U. (2002). Standards for Internet-based experimenting. Experimental Psychology, 49(4), 243–256. doi: 10.1026//1618-3169.49.4.243
  51. Ruppertsberg, A., Givaty, G., Van Veen, H.., & Bülthoff, H. (2001). Games as research tools for visual perception over the Internet. In U.-D. Reips & M.Bosnjak (Eds.), Dimensions of Internet Science. pp. 147–158. Lengerich, Germany: Pabst.
  52. Teitcher, J., Bockting, W., Bauermeister, J., Hoefer, C., Miner, M., & Klitzman, R. (2015). Detecting, preventing, and responding to “fraudsters” in Internet research: Ethics and tradeoffs. The Journal of Law, Medicine & Ethics, 43(1), 116–133. doi: 10.1111/jlme.12200
  53. Ulrich, R., & Giray, M. (1989). Time resolution of clocks: Effects on reaction time measurement—Good news for bad clocks. British Journal of Mathematical and Statistical Psychology, 42(1), 1–12. doi: 10.1111/j.2044-8317.1989.tb01111.x
  54. van Steenbergen, H., & Bocanegra, B. (2015). Promises and pitfalls of Web-based experimentation in the advance of replicable psychological science: A reply to Plant (2015). Behavior Research Methods. 48(4), 1713–1717. doi: 10.3758/s13428-015-0677-x
  55. von Ahn, L., Blum, M., Hopper, N., & Langford, J. (2003). CAPTCHA: Using hard AI problems for security. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2656, 294–311.
  56. White, A., Strezhnev, A., Lucas, C., Kruszewska, D., & Huff, C. (2018). Investigator characteristics and respondent behavior in online surveys. Journal of Experimental Political Science, 5(1), 56–67. doi: 10.1017/xps.2017.2

Ready to start using the world's easiest online experiment builder?

Conduct simple psychology tests and surveys, or complex factorial experiments. Increase your sample size and automate your data collection with experiment software that does the programming for you.

Behavioral experiments. Superior stimulus design. No code.

Ben Howell
Ben Howell
Founder, Psychstudio