1  What are we doing?

It is always good to start a course thinking about exactly what it is we are doing. Toward that end, we start with a question. What is the goal of doing (biological) experiments? There are many answers you may have for this. Some examples:

More obnoxious answers are

This question might be better addressed if we zoom out a bit and think about the scientific process as a whole. Below, we have a sketch of the scientific processes. This cycle repeats itself as we explore nature and learn more. In the boxes are milestones, and along the arrows in orange text are the tasks that get us to these milestones.

Cycle of science.
Figure 1.1: The cycle of science, adapted from Fig. 1.1 of P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, Cambridge University Press, 2005.

We start to the left with a hypothesis of how a biological system works, often expressed as a sketch, or cartoon, which is something we can have intuition about. Indeed, it is often a cartoon of how a system functions that we have in mind. From these cartoons, we can use deductive inference to move our model from a cartoon to a prediction. We then perform experiments to test our predictions, that is to assault our hypotheses, and the output of experiments is data. Once we have our data, we make inferences from them that enable us to update our hypothesis, thereby pushing human knowledge forward.

Let’s consider the tasks and their milestones in a little more depth. We start in the lower left.

1.1 Data analysis pipelines

Let’s consider the lower right arrow in the figure above, going from a data set to updated hypotheses and parameter estimates, the heart of this course. We zoom in on that arrow, there are several steps, which we will call a data analysis pipeline, depicted below.

Figure 1.2: Data analysis pipeline. Each block represents a milestone along the pipeline and each arrow a task to bring you there.

Let’s look at each step.

  • Data validation. We always have assumptions about data coming from a data source into the pipeline. We have assumptions about file formats, data organization, etc. We also have assumptions about the data themselves. For example, fluorescence measurements must all be nonnegative. Before proceeding through the pipeline, we need to make sure the data comply with all of these assumptions, lest we have issues down the pipeline. This process is called data validation.
  • Data wrangling. This is the process of converting the format of the validated data from the source to a format that is easier to work with. This could involved restructuring tabular data or extracting useful information out of images, etc. There are countless examples.
  • Exploratory data analysis. Once data sets are tidy and easy to work with, we can start exploring them. Generally, this involves making informative graphics looking at the data set from different angles. Sometimes, this is sufficient to learn from the data, and we can proceed directly to updated hypotheses and publication, but we often also need to perform statistical inference, which is in the next two steps of the pipeline.
  • Parameter estimation. This is the practice of quantifying the observed effects in the data, and in particular quantifying the uncertainty in our estimates of the effect sizes. This falls under the umbrella of statistical inference.
  • Hypothesis testing. Also a technique of statistical inference, hypothesis testing involves evaluation of hypotheses about how the data were generated. This usually gives some quantitative measure about the hypotheses, but there are many issues with quantifying the “truth” of a hypothesis.

In this course, we will learn about all of these steps. We will spend roughly a third of the class on data validation and wrangling and exploratory data analysis, and the rest of the class on statistical inference.