Data Parsing and Metadata • fpcountr

The first step in any data analysis workflow is converting raw data produced by a data source (in our case, a plate reader) into a format that may be manipulated effectively for analysis. The parse() symbols in the overview figure below indicate that this is required for data output from each of the types of raw data analysed by this package: the spectrum data from the absorbance assay, the endpoint data from the fluorescence assay, and the endpoint/timecourse data from the experimental assay.

Data parsing functions read raw data to (1) extract data from the raw data file (removing unnecessary information), (2) tidy data into a so-called ‘tidy data’* format and (3) join data to appropriate metadata necessary for their analysis (such as information about the details of the samples in each well, the identity of the plate reader used, etc.).

*Tidy data is of the form where every column is a variable, every row is an observation, and every cell contains a single value.

To appreciate what parsing does to data, consider the differences between the following raw data:

..and parsed data:

In order to parse your data, you will need to create a metadata file, and identify a parsing method for it.

Metadata Requirements

What do we mean by metadata and why is it necessary? Some metadata is essential for calculations, such as the ability to group samples by a variable, and take a mean or standard deviation for each level (e.g. according to which dilution we measured in a calibration, or the level of inducer_concentration that was used in an experiment). Others are required to specify axes in plots, e.g. dilution for calibrations or time for a timecourse experiment. Still others are used to verify calculations: calibration values will differ for different instruments, so instrument is a crucial bit of metadata for every calibration.

Metadata files should be prepared in ‘tidy data’ format, and may look something like this, although the exact variables will differ for each assay type.

For calibrations: absorbance spectra

Function	Metadata Expected	Required For
parse_magellan_spectrum()	well	generating 'row' and 'column' variables
process_absorbance_spectrum()	instrument	reproducibility
	plate	reproducibility
	seal	reproducibility
	media	reproducibility
	calibrant	checking number of calibrants
	protein	removing unused wells
		finding blank wells
	dilution	separating plots
		identifying blanks
		plotting as x axis
	replicate	colouring plots
	volume	finding path lengths
	well	finding measurement columns
		path length calculations
	row	finding measurement columns
		finding rows to keep
	column	finding columns to keep
get_conc_ecmax()	instrument	reproducibility
		grouping
	plate	reproducibility
	seal	reproducibility
	media	reproducibility
		grouping
	calibrant	grouping
	protein	finding blank wells
		grouping
	dilution	calculating concentration
		plotting
		grouping
	well	removing wells

For calibrations: fluorescence data

Function	Metadata Expected	Required For
parse_magellan()	well	generating 'row' and 'column' variables
generate_cfs()	instrument	reproducibility
	plate	reproducibility
	seal	reproducibility
	channel_name	verifying cfs file matches channel used
		extracting gain
	channel_ex	reproducibility
	channel_em	reproducibility
	media	reproducibility
	calibrant	verifying unique calibrants
		adding calibrant names to cfs files
	protein	removing unused wells
		finding blank wells
	mw_gmol1	calculating molecule number
	concentration_ngul	calculating molecule number
		calculating dilution ratios
	replicate	identifying saturated values
	volume	calculating molecule number
	well	finding measurement columns
	row	finding measurement columns

For experimental data

Function	Metadata Expected	Required For
parse_magellan()	well	generating 'row' and 'column' variables
process_plate()	volume	converting 'OD' to 'OD (cm-1)'
	row	plotting data in 'microplate' format
	column	plotting data in 'microplate' format

Notes:

Variables that are not required for calculations are included for the sake of reproducibility.
channel_name/ex/em refers to the fluorescence ‘channel’, i.e. filter set. The one most commonly used for GFPs includes excitation filter at 485nm with 20nm bandwidth and emission at 535nm with 25nm bandwidth. For this, channel_name may be ‘green’ or ‘blue_ex_green_em’ or anything that is concise and meaningful for your calibrations. If you are taking calibrations at several gains (as is recommended), the names given to your fluorescence measurements/readings must match the value in channel_name exactly (e.g. channel_name = ‘green’ and measurements taken are called ‘green_40’, ‘green_50’, .. ‘green_120’). channel_ex could be “485/20”.
protein is expected to equal “none” for the blank/buffer wells and to be empty for wells that were not used in the assay.
dilution should be specified as 1 for the undiluted sample, 0.5 for a 2-fold diluted sample, etc. dilution must be left blank for blank and buffer wells.
mw_gmol1 is molecular weight in g/mol.
concentration_ngul is concentration in ng/ul.
row and column are also expected in every function above, but these should be auto-generated by the parser function from the metadata well column.
time should be auto-generated by the parser function for timecourse data.

Save metadata templates

For a shortcut to creating your own metadata templates, run:

save_metadata_template(
  data_type = "absspectrum", # OPTIONS: "absspectrum", "fluordata", "exptdata"
  plate_type = 96, # 96 for 96-well plate, etc
  outfolder = "templates" # where to save files
)

Methods for Parsing Data

Using Tecan Spark plate readers running Magellan software

As fpcountr was developed in a laboratory that used Tecan Spark instruments, we have bundled two functions that can parse data exported from the Spark/Magellan platform into the package.

parse_magellan() handles standard/endpoint and timecourse/kinetic data. To parse a data file, the minimum information you need is the location of your data and metadata files, and whether/not your data is a timeseries (i.e. timecourse/kinetic data). For fluorescence data from endpoint/standard assays, such as calibrations, you might use:

parsed_data <- parse_magellan(
  data_csv = "path/to/data.csv",
  metadata_csv = "path/to/metadata.csv",
  timeseries = FALSE
)

This function can work out the number of readings with no extra information, assuming that the only exported information is data, and not metadata. If the raw data export file includes lines above and below the fluorescence or absorbance data occupied by metadata such as ‘Well Positions’, ‘Layout’ or ‘Replicate Info’, this needs indicating with metadata_above (to indicate lines above the data taken up by such metadata), or metadata_below.

parsed_data <- parse_magellan(
  data_csv = "path/to/data.csv",
  metadata_csv = "path/to/metadata.csv",
  timeseries = FALSE,
  metadata_above = 1,
  metadata_below = 0
)

Further, if you exported temperature data with the fluorescence data, causing the data to be right-shifted from the otherwise default second column, use custom = TRUE and specify startcol and endcol.

parsed_data <- parse_magellan(
  data_csv = "path/to/data.csv",
  metadata_csv = "path/to/metadata.csv",
  timeseries = FALSE,
  custom = TRUE, startcol = 3, endcol = 98
)

If the data does not include all 96 wells expected, e.g. it is truncated at H11 instead of H12 due to export options, you can use insert_wells_below = 1 which will add a blank line at the end of the data. This can help with matching metadata files containing entries for 96 wells with data containing fewer than 96 entries.

parsed_data <- parse_magellan(
  data_csv = "path/to/data.csv",
  metadata_csv = "path/to/metadata.csv",
  timeseries = FALSE,
  custom = TRUE,
  startcol = 3,
  endcol = 97,
  insert_wells_above = 0,
  insert_wells_below = 1
)

For timecourse/kinetic assays, such as from microbial growth curve experiments, choose timeseries = TRUE and specify:

timestart: the text in column 1 with which to identify the first row of data. This is usually “0s”.
interval: the number of minutes between any 2 readings
mode: which specifies whether an interval starts with readings (“read_first”) or an incubation (“incubate_first”). This will normally be “read_first”.

parsed_data <- parse_magellan(
  data_csv = "path/to/data.csv",
  metadata_csv = "path/to/metadata.csv",
  timeseries = TRUE,
  timestart = "0s",
  interval = 10, # in minutes
  mode = "read_first", # mode can only be "read_first" or "incubate_first"
  metadata_above = 1,
  metadata_below = 0,
  custom = TRUE,
  startcol = 3,
  endcol = 97,
  insert_wells_above = 0,
  insert_wells_below = 1
)

A separate function, parse_magellan_spectrum(), should be used to parse spectrum data. It works similarly.

parsed_data_spectrum <- parse_magellan_spectrum(
  data_csv = "path/to/data.csv",
  metadata_csv = "path/to/metadata.csv",
  wellstart = "A1", wellend = "H12"
)

It is worth sanity checking outputs to make sure that the data has been extracted correctly, and adjusting the customisation parameters if it isn’t.

Examples

library(fpcountr)

1. Absorbance spectrum data for a calibration of mTagBFP2

Raw data:

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
Spectrum data	NA	NA	NA	NA	NA	NA	NA	NA		NA
[nm]	200.0000	201.0000	202.0000	203.0000	204.0000	205.0000	206.0000	207.0000	208	209.0000
A1	3.2934	3.2799	3.3390	3.3396	3.3812	3.3875	3.4014	3.3519	3.3743	3.3190
B1	3.2622	3.3121	3.3903	3.6836	3.4257	3.3808	3.4499	3.4804	3.4558	3.3016
A2	3.2612	3.3153	3.3262	3.3553	3.4475	3.3497	3.3685	3.3631	3.389	3.3843
B2	3.2615	3.2786	3.3289	3.5397	3.4848	3.3831	3.4237	3.4103	3.4671	3.3550
A3	3.2943	3.3587	3.3668	3.3376	3.3925	3.3521	3.3763	3.3388	3.3662	3.3503
B3	3.2896	3.3469	3.3185	3.3641	3.3797	3.2989	3.3664	3.3426	3.3452	3.3142
A4	3.2608	3.3900	3.3016	3.3737	3.4401	3.3511	3.4485	3.3411	3.5792	3.4078

Code:

parsed_data <- parse_magellan_spectrum(
  data_csv = "data/example_absorbance.csv",
  metadata_csv = "data/example_absorbance_meta.csv",
  wellstart = "A1", wellend = "B12"
)

## 24 wells identified.

Parsed data:

instrument	calibrant	dilution	well	200	201	202	203	204	205	206	207	208
spark1	mTagBFP2	1.000000000	A1	3.2934	3.2799	3.3390	3.3396	3.3812	3.3875	3.4014	3.3519	3.3743
spark1	mTagBFP2	0.500000000	A2	3.2612	3.3153	3.3262	3.3553	3.4475	3.3497	3.3685	3.3631	3.3890
spark1	mTagBFP2	0.250000000	A3	3.2943	3.3587	3.3668	3.3376	3.3925	3.3521	3.3763	3.3388	3.3662
spark1	mTagBFP2	0.125000000	A4	3.2608	3.3900	3.3016	3.3737	3.4401	3.3511	3.4485	3.3411	3.5792
spark1	mTagBFP2	0.062500000	A5	3.2948	3.3343	3.3076	3.5574	3.7229	3.8790	3.4160	3.3230	3.8098
spark1	mTagBFP2	0.031250000	A6	3.2894	3.3136	3.3454	3.4588	3.3556	3.3564	3.3910	3.4368	3.2939
spark1	mTagBFP2	0.015625000	A7	3.2930	3.3226	3.3383	3.3680	3.3786	3.4243	3.4188	3.2832	3.3297
spark1	mTagBFP2	0.007812500	A8	3.2936	3.3589	3.3350	3.3625	3.4282	3.3435	3.3975	3.3195	3.4037
spark1	mTagBFP2	0.003906250	A9	3.2909	3.2733	3.3648	3.3259	3.4149	3.3244	3.4223	3.3273	3.3610
spark1	mTagBFP2	0.001953125	A10	3.2934	3.2913	3.3022	3.4590	3.4969	3.3739	3.3996	3.4085	3.3520
spark1	mTagBFP2	0.000976563	A11	3.2608	3.2641	3.3324	3.3941	3.4011	3.3272	3.4094	3.3708	3.3890
spark1	mTagBFP2	NA	A12	3.2570	3.3090	3.3415	3.3510	3.3476	3.3447	3.4181	3.3793	3.3385

2. Fluorescence data for a calibration of mTagBFP2

Raw data:

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18
blueblue040					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue050					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue060					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue070					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue080					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue090					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue100					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue110					NA	NA	NA	NA	NA	NA	NA	NA					NA
blueblue120					NA	NA	NA	NA	NA	NA	NA	NA					NA
0s	133	68	36	19	11	7	4	4	3	3	3	3	134	74	35	19	11
43s	737	394	219	131	88	65	54	49	46	45	45	43	751	431	219	132	93
80s	2960	1588	886	540	370	278	233	213	200	198	194	191	2986	1720	889	546	385
118s	9097	4875	2739	1662	1153	861	730	663	631	625	608	595	9180	5292	2718	1676	1199

Code:

parsed_data <- parse_magellan(
  data_csv = "data/example_fluorescence.csv",
  metadata_csv = "data/example_fluorescence_meta.csv",
  timeseries = FALSE,
  custom = TRUE,
  startcol = 2,
  endcol = 25,
  insert_wells_above = 0,
  insert_wells_below = 72
)

Parsed data:

instrument	channel_name	channel_ex	channel_em	calibrant	well	blueblue040	blueblue050	blueblue060	blueblue070	blueblue080	blueblue090	blueblue100	blueblue110	blueblue120
spark1	blueblue	400/20	465/35	mTagBFP2	A1	133	737	2960	9097	22595	50522	NA	NA	NA
spark1	blueblue	400/20	465/35	mTagBFP2	A2	68	394	1588	4875	12140	27526	57864	NA	NA
spark1	blueblue	400/20	465/35	mTagBFP2	A3	36	219	886	2739	6849	15485	32882	64351	NA
spark1	blueblue	400/20	465/35	mTagBFP2	A4	19	131	540	1662	4154	9504	20037	39740	NA
spark1	blueblue	400/20	465/35	mTagBFP2	A5	11	88	370	1153	2868	6530	13895	27655	51126
spark1	blueblue	400/20	465/35	mTagBFP2	A6	7	65	278	861	2168	4945	10525	20781	38599
spark1	blueblue	400/20	465/35	mTagBFP2	A7	4	54	233	730	1830	4166	8822	17547	32679
spark1	blueblue	400/20	465/35	mTagBFP2	A8	4	49	213	663	1668	3787	8036	15969	29849
spark1	blueblue	400/20	465/35	mTagBFP2	A9	3	46	200	631	1582	3604	7658	15187	28256
spark1	blueblue	400/20	465/35	mTagBFP2	A10	3	45	198	625	1555	3547	7516	14953	27956
spark1	blueblue	400/20	465/35	mTagBFP2	A11	3	45	194	608	1523	3502	7404	14721	27357
spark1	blueblue	400/20	465/35	mTagBFP2	A12	3	43	191	595	1504	3400	7237	14436	26968

3. Experimental data

Raw data:

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16
Well positions		NA											NA	NA
OD600		NA											NA	NA
OD700		NA											NA	NA
blue		NA											NA	NA
bluelow		NA											NA	NA
		NA	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	NA	NA	B2
0s	30.2 °C	NA	0.1255	0.1276	0.1323	0.1219	0.1294	0.129	0.1282	0.1228	0.1284	0.0969	NA	NA	0.1283
599s	29.9 °C	NA	0.1298	0.1318	0.1351	0.1244	0.1293	0.1297	0.1287	0.1216	0.1256	0.095	NA	NA	0.1281
1199s	29.9 °C	NA	0.1333	0.1355	0.1384	0.1272	0.1325	0.1323	0.1315	0.1246	0.1279	0.0942	NA	NA	0.1307
1799s	29.9 °C	NA	0.1383	0.1407	0.1436	0.131	0.1363	0.1359	0.1347	0.1278	0.1311	0.0937	NA	NA	0.1342
2399s	29.8 °C	NA	0.1429	0.1454	0.1484	0.1357	0.1409	0.1402	0.1391	0.1319	0.1348	0.0935	NA	NA	0.1388

Code:

parsed_data <- parse_magellan(
  data_csv = "data/example_experiment.csv",
  metadata_csv = "data/example_experiment_meta.csv",
  timeseries = TRUE,
  timestart = "0s",
  interval = 10, # minutes.
  mode = "read_first", # mode can only be "read_first" or "incubate_first"
  metadata_above = 1,
  metadata_below = 0,
  custom = TRUE,
  startcol = 3,
  endcol = 97,
  insert_wells_above = 0,
  insert_wells_below = 1
)

Parsed data:

media	strain	plasmid	inducer	ara_pc	volume	well	OD600	OD700	blue	bluelow
					NA	A1	NA	NA	NA	NA
M9	DH10B	pS361	ara	0	200	A2	0.1255	0.1137	471	17
M9	DH10B	pS361	ara	0	200	A3	0.1276	0.1121	472	17
M9	DH10B	pS361	ara	0	200	A4	0.1323	0.1173	481	18
M9	DH10B	pS361	ara	0.3	200	A5	0.1219	0.1088	479	17
M9	DH10B	pS361	ara	0.3	200	A6	0.1294	0.1154	475	17
M9	DH10B	pS361	ara	0.3	200	A7	0.1290	0.1152	472	17
M9	DH10B	pS361_ara_mTagBFP2	ara	0	200	A8	0.1282	0.1139	468	17
M9	DH10B	pS361_ara_mTagBFP2	ara	0	200	A9	0.1228	0.1090	466	17
M9	DH10B	pS361_ara_mTagBFP2	ara	0	200	A10	0.1284	0.1144	469	17
M9	none	none	none	none	200	A11	0.0969	0.0910	473	17
					NA	A12	NA	NA	NA	NA

Using other plate readers

The above functions use clues from Magellan’s standardised data formats to extract the data from raw files correctly, and therefore will not work for data from other plate readers or file types.

For other plate readers, including Tecan Spark plate readers running SparkControl software (which exports data in a different format), we recommend two options:

Writing your own functions

For more advanced users or when working with a large number of similar data files, writing your own functions will be the best option to automate parsing of a large number of files. Note that:

the output needs to be ‘tidy data’
fpcountr functions additionally expect that the metadata is joined to the left of the data, and that the well column of the metadata has been used to create two further columns: row, and column, which are placed to the right of the data. The resultant structure of required parsed data files are, left to right: [all metadata columns], [all data columns], ‘row’, ‘column’.

Using Parsley

For users who do not have time to do this or do not need to automate parsing, our recommended route is to use Parsley. Parsley is a web app we developed to allow you to parse data of any type into ‘tidy data’ format. The graphical user interface (GUI) lets you select the location of the data and all its characteristics within your raw data file using a point and click interface, and generates both a parsed data spreadsheet for you, as well as a ‘parser function’ file, that can be reused for parsing multiple data files. The software is free, available on the web and does not require the download of any software or the creation of any account. (However, you can install it from Github and run it locally if you would like to. Running it locally speeds it up.)