Before you Begin

Formally map every piece of data you collect to your study plan and/or reporting requirements, and be sure that the data collection will supply everything you need for analysis. Avoid collecting data you do not need and identify data critical to your research question.

Think about your data in terms of the number of variables you are trying to capture. You should, obviously, try to get as much data as you need to prove or disprove your hypothesis. However, if you set out to collect too much data you not only potentially waste resources and participants time, then you might find your analysis is irrelevant and/or uninformative. Trying to manage too many data elements makes it easy to overlook errors in the data you care most about.

It is critical to consider your statistical analysis before you collect data to ensure nothing has been overlooked in the data collection design.  Independant statistical advice is often useful in case alternative approaches can vastly improve the power and quality of your analysis.

Data Design

Provide as much information as possible to assist in data entry. Use the Field Label and Field Notes to describe exactly what kind of data you intend to capture in a given data entry field. Do not assume the data entry person knows the expected data, units, or formats for each field.

Use Field Notes mainly to supplement the Field Label information, for example use Field Notes to specify the expected format of a validated field or the expected units of measurement.

Create a Codebook

A codebook describes each variable by name according to the type of data – numeric, date/time, character, the units of measurement (e.g. grams, micrograms per deciliter) and the purpose of collecting it and its relationship to other data. The codebook is a human-readable, read-only version of the project’s Data Dictionary and in REDCap both are found on your project’s Home

Minimise use of free-text fields

Minimize the use of free text fields because these complicate both processing and analysis of data. Typically spelling errors and variations of spelling require substancial  data cleaning before they can be used. Use categorical response field types (i.e. dropdown, radio or checkbox) choices instead of free text fields (i.e. text box and notes box). The use of multiple choice field types will improve later data analysis. You can augment a categorical response with a text box or notes box to capture additional information if required. You can also minimise erroneous data entry and range errors by contraining choices to preexising values. 

Do not mix data types

It is possible to mix data types in data entry fields. For example, a researcher might enter a numerical code followed by a comment on the code such as “147 Patient had a cold.” Both the code and the comment might be informative but they should be placed in separate data entry fields (with data validation where applicable).

Use data validation

When using text box fields, use validation types and set minimum and/or maximum values as much as possible for better data accuracy. In particular, always use data validation for known formats of a field, for example date and email addresses should be validated at data entry.

For inexact dates enter day, month and year seperately

Most people will be able to tell you their date of birth. However, very few of them will be able to tell you the date they first noticed symptoms of a particular disease. However, they might be able to tell you the month and year they noticed symptoms. For this type of data, consider entering the month, day and year in separate columns. For example, a patient might not be able to tell you when he first came down with measles as a child. However, they might be able to limit the range of dates to say, March of 2021. You could enter this into a database as:

Day Month Year
3 2021

Be consitent when using numerical codes

For example, if “unknown” is coded as 99 in one response, it should be coded as 99 wherever it appears in the database (although this would not be a good choice if age were also a field as age 99 is a valid value). Similary if Yes = 1 and No = 0 be consistent across the entire database and not chop and change between fields. The numerical code do not affect the order that choices are displayed in the REDCap data entry form.

Handle missing values consistently

Does a blank value mean you still need to collect that value, the value is not available or that the value is not applicable?

Missing values might be missing for different reasons and the reasons they are missing might be relevant to your study. If the researcher’s approach to missing values is just to leave a blank in a data set when there is no value, then that researcher unknowingly might be throwing away useful data.

For example, a person might have a missing value for a  medical test.  It might be missing because the person still needs to take the test, the person forgot to take the test or the person doesn’t recall the results rather then the person not doing the test. . o record the value, or because the person does not have a thyroid gland. Also consider whether yes or no are really the only choices for a given
question. For example a simple yes/no answer may not suffice for questions such as “Have you ever had disease X?” because patients might answer not only yes or no but also “I don’t know” or “I was tested for that once but I do not remember the results” or even “I do not want to answer that question.”

If data are missing or unknown, you can include reasons in your categorical responses or mark the data as missing and include a text box to record the reason for the missing value.