Georgetown Law
Georgetown Law Library

Statistics and Empirical Legal Studies Research Guide

Data Collection

Data collection involves identifying the "population" about which you hope to make inferences, then figuring out what data you need based on your research questions, theories, and the observable implications of those theories. Sometimes you will need to collect or generate data from scratch, while other times you may be able to reuse existing data. If you cannot reuse existing data, you will need to make a plan for collecting or generating new data in an unbiased manner.

Identifying the Target Population

Before you collect any data, you should define your target population. The target population, or population of interest, is the population about which we want to make inferences. The population consists of all the units (people, cases, countries - whatever we are studying) about which we would collect data if our resources were unlimited, and depends on our research question. Epstein & Martin 64-65 (2014). A population may be delimited by geographic factors (e.g., people in Texas), social factors (e.g., law students) and timeframe factors (e.g., all the Texas law students who took at least one class between 2012 and 2018), among others. Ryan 155 (2015). Defining your target population is essential to planning your data collection process, because you need to collect data from a representative sample of the population about which you want to make inferences.

Locating, Gathering or Generating Data

Once you define your target population, you should be able to determine what data you need based on that and the measures you adopted during the research design process. Epstein & Martin 65 (2014). Before you attempt to gather or generate your own data, you should look into whether someone else has already produced data that you can reuse. See the Existing Datasets and Statistics portion of this guide for more information. If you decide to rely on existing data, you should first be sure you understand the process that was used to generate or gather it, including what population it comes from. Id. at 69. If you have checked for existing data sources and determined that they either do not exist or are of inadequate quality, you will have to gather or generate your own data. 

In empirical legal research, there are four ways researchers commonly generate their own data: performing experiments, surveying, observing, and analyzing text. Epstein & Martin 70 (2014). True experiments are rare, especially in legal research. A true experiment requires both the random selection of units from the studied population, and the random assignment of those units to either a control group or a treatment group, something that is difficult to do in a legal setting. Id. at 71. Surveys are more common, and involve asking people about their attitudes, opinions, or behavior. However, people selected for a survey may not always respond truthfully, accurately, or at all. Id. at 74-75. Observation involves watching your units of study as they engage in real-world activities. Some observational studies will involve subjects who know they are being observed and may alter their behavior accordingly. All observational studies require researchers to document the observed behavior, possibly introducing bias into the process, as when the researcher notes behavior that supports his or her hypothesis and fails to note behavior that doesn't. Id. at 80-81. Analyzing text can involve the simple extraction of facts (such as the race of a suspect described in a criminal probable cause affidavit) or the more complex and subjective extraction of "sentiment" (such as determining the ideology of different judges by analyzing their opinions). Id. at 81-82.

Analyzing legal texts such as judicial opinions and statutes is an extremely common activity in empirical legal research, and therefore deserves more attention here than the other three methods of gathering or generating data. Text analysis is one example of the broader technique known as "content analysis," and involves "collect[ing] a set of documents, such as judicial opinions on a particular subject, and systematically read[ing] them, recording consistent features of each and drawing inferences about their use and meaning." Hall & Wright 64 (2008). The process of "recording consistent features" of each text is also known as coding, and allows for translating properties of the text into variables that are suitable for systematic analysis. Epstein & Martin 95 (2014). For much more information about the empirical analysis of legal texts, see Hall & Wright (2008) and Epstein & Martin ch. 5 (2014). 

Because different types of data collection are subject to different types of measurement error, it may be useful to collect data using more than one method. Epstein & King 102 (2002).

Deciding How Much Data to Collect

Collect as much data as you can given the available resources. It is better to collect too much data than too little. Id. For example, if you are analyzing the outcome of federal appellate cases and your research question requires you to know which president appointed each deciding judge but you are uncertain whether you need to know the judges' ages, you should probably err on the side of recording their ages. The same process that allows you to determine the appointing presidents (reviewing judicial biographies) may also allow you to determine their ages. If you don't record all potentially relevant information the first time through and decide that you need it later, you will end up duplicating efforts. If you record data that you don't end up using for the current project, maybe it will be useful for some later project.

Another factor in deciding how much data to collect is how much uncertainty you are willing (or able) to tolerate in your research results. In general, the more data your conclusions are based on, the more certain those conclusions are. If you are reporting inferential statistics, you should include the margin of error, which is a measure of uncertainty about the statistics. Mathematically, a larger sample size results in a smaller margin of error. Epstein & Martin 85 (2014).

Avoiding Bias in the Selection of Observations

It is often not possible to study all members of a target population. In such situations, you will want to use a method of selecting observations that won't bias your sample either for or against your theory. Id. at 86-87. If you have the resources to collect data from a fairly large sample, then the least biased way to select the sample is through random probability sampling. "A random probability sample is a sample in which each element in the total population has a known (and preferably the same) probability of being selected." Id. at 87. If you do not have the resources to conduct a large sample, other selection mechanisms may be better for avoiding bias.

A great deal more information about sampling techniques is available in Johnnie Daniel, Sampling Essentials: Practical Guidelines for Making Sampling Choices (2012).

Documenting Collection Methods

Regardless of which methods are used to collect data, the collection process should be thoroughly documented. Documentation is adequate when it provides enough detail so that other researchers could replicate the data collection process independently. Epstein & King 38 (2002). For example, consider the statement that jurors were "randomly sampled" for interviews. Id. at 39. Based on this statement alone, could you repeat the original author's method of selecting a group of jurors to interview? Documenting your data collection methods makes your ultimate conclusions more credible to other researchers, because it permits them to evaluate your methods for flaws and biases.