The isatabr package is developed as a easy-to-use package for reading, modifying and writing files in the Investigation/Study/Assay (ISA) Abstract Model of the metadata framework using the ISA tab-delimited (TAB) format.
ISA is a metadata framework to manage an increasingly diverse set of life science, environmental and biomedical experiments that employ one or a combination of technologies. Built around the Investigation (the project context), Study (a unit of research) and Assay (analytical measurements) concepts, ISA helps you to provide rich descriptions of experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable.
The ISA-Tab structure is described in full detail on the ISA-tab website. The description below is mostly taken from there and slightly condensed when appropriate.
ISA-Tab uses three types of file to capture the experimental metadata:
The Investigation file contains all the information needed to understand the overall goals and means used in an experiment; experimental steps (or sequences of events) are described in the Study and in the Assay file(s). For each Investigation file there may be one or more Studies defined with a corresponding Study file; for each Study there may be one or more Assays defined with corresponding Assay files.
In order to facilitate identification of ISA-Tab component files, specific naming patterns should be followed:
The Investigation file fulfills four needs:
An Investigation file is structured as a table with vertical headings along the first column, and corresponding values in the subsequent columns. The following section headings must appear in the Investigation file (in order), and the study block (headings from STUDY to STUDY CONTACTS) can be repeated, one block per study associated with the investigation.
For a full description of all sections see the aforementioned site.
The Study file contains contextualizing information for one or more assays, for example; the subjects studied; their source(s); the sampling methodology; their characteristics; and any treatments or manipulations performed to prepare the specimens.
The Assay file represents a portion of the experimental graph (i.e., one part of the overall structure of the workflow); each Assay file must contain assays of the same type, defined by the type of measurement (e.g. gene expression) and the technology employed (e.g. DNA microarray). Assay-related information includes protocols, additional information relating to the execution of those protocols and references to data files (whether raw or derived).
As an example for working with the isatabr
package we will use the data set that accompanies Atwell et al. (2010). The associated files are included in the package.
ISA-Tab files can be stored in two different ways, either as separate files in a directory, or as .zip file containing the files. The example data is included in both ways in the package. Both formats can be read into R
using the readISATab
function.
When reading ISA-Tab files from a directory, only the name of the directory, where the ISA-TAB files are located, needs to be specified.
## Read ISA-Tab files from directory.
<- readISATab(path = file.path(system.file("extdata/Atwell", package = "isatabr"))) isaObject1
When reading zipped files, both the directory, where the zip-file is located, and the name of the file need to be specified.
## Read ISA-Tab files from directory.
<- readISATab(path = file.path(system.file("extdata", package = "isatabr")),
isaObject2 zipfile = "Atwell.zip")
In both cases readISATab
will automatically detect the Investigation, Study and Assay files assuming the naming conventions described in the previous section are followed. If this is not the case, the function will give an error indicating the problem. The imported ISA-Tab files are stored in an object of the S4 class ISA
. Since the information is almost identical for reading files from a directory and zipped-files, the following sections will show the example for the files read from a directory only.
All information from the ISA-Tab files is stored within slots in the ISA
object. The table below gives an overview of the different slots and a brief description of the information stored in the slot. For a more exhaustive description see help("ISA")
. Note that an investigation may have multiple studies. Therefore, data concerning studies is stored in a list
object, where one element in the list
corresponds to one study. Likewise a study may consist of multiple assays and assay data is stored in a list
object, where one element in the list
corresponds to one assay.
Slot | Type | Description |
---|---|---|
path | character |
path to the ISA-Tab files |
iFileName | character |
name of the investigation file |
oSR | data.frame |
ONTOLOGY SOURCE REFERENCE section of investigation file |
invest | data.frame |
INVESTIGATION section of investigation file |
iPubs | data.frame |
INVESTIGATION PUBLICATIONS section of investigation file |
iContacts | data.frame |
INVESTIGATION CONTACTS section of investigation file |
study | list of data.frames |
STUDY sections of investigation file |
sDD | list of data.frames |
STUDY DESIGN DESCRIPTORS sections of investigation file |
sPubs | list of data.frames |
STUDY PUBLICATIONS sections of investigation file |
sFacts | list of data.frames |
STUDY FACTORS sections of investigation file |
sAssays | list of data.frames |
STUDY ASSAYS sections of investigation file |
sProts | list of data.frames |
STUDY PROTOCOLS sections of investigation file |
sContacts | list of data.frames |
STUDY CONTACTS sections of investigation file |
sFiles | list of data.frames |
content of study files |
aFiles | list of data.frames |
content of assay files |
All slots have corresponding functions for accessing and modifying information. The names of these access functions are the same as the slots they refer to, e.g. accessing the iFileName slot in an ISA
object can be done using the iFileName()
function. There is one exception to this. To prevent problems with the path()
function, that already exists in quite some other packages, the path slot in an ISA
object should be accessed using the isaPath()
function.
## Access path for isaObjects
isaPath(isaObject1)
#> [1] "C:\\Users\\rossu027\\AppData\\Local\\Temp\\RtmpygYYN5\\Rinst2394d6a49a1\\isatabr\\extdata\\Atwell"
isaPath(isaObject2)
#> [1] "C:\\Users\\rossu027\\AppData\\Local\\Temp\\Rtmp6ZQbzc"
The path for isaObject1
shows the directory from which the files were read. As isaObject2
was read directly for a zipped archive, the files were first extracted into a temporary folder and subsequently read from there. This temporary folder is shown as the path.
The other slots are accessible in a similar way. Some more examples are shown below.
## Access studies.
<- study(isaObject1)
isaStudies
## Print study names.
names(isaStudies)
#> [1] "GMI_Atwell_study"
## Access study descriptors.
<- sDD(isaObject1)
isaSDD
## Shows study descriptors for study GMI_Atwell_study.
$GMI_Atwell_study
isaSDD#> Study Design Type
#> 1 GWAS of 107 phenotypes in Arabidopsis thaliana inbred lines using ~250k SNPs in 199 accessions
#> Study Design Type Term Accession Number Study Design Type Term Source REF
#> 1 <NA> <NA>
It is not only possible to access the different slots in an ISA
object, the slots can also be updated. As the access function, the update functions have the same name as the slots they refer to. As an example, let’s assume an error sneaked into the ONTOLOGY SOURCE REFERENCE section and we want to update one of the source versions.
First have a look at the current content of the ONTOLOGY SOURCE REFERENCE section.
<- oSR(isaObject1))
(isaOSR #> Term Source Name Term Source File Term Source Version
#> 1 OBI http://data.bioontology.org/ontologies/OBI 23
#> 2 EFO http://data.bioontology.org/ontologies/EFO 118
#> 3 UO http://purl.obolibrary.org/obo/UO <NA>
#> 4 NCBITaxon http://data.bioontology.org/ontologies/NCBITAXON 6
#> 5 PO http://data.bioontology.org/ontologies/PO 10
#> 6 GMI http://gwas.gmi.oeaw.ac.at/ <NA>
#> Term Source Description
#> 1 Ontology for Biomedical Investigations
#> 2 Experimental Factor Ontology
#> 3 Unit Ontology
#> 4 National Center for Biotechnology Information (NCBI) Organismal Classification
#> 5 Plant Ontology
#> 6 Cataloque of Arabidopsis accessions at GMI
Now we update the version of the OBI ontology source from 23 to 24. Then we update the modified ontology source data.frame
in the ISA
object.
## Update version number.
1, "Term Source Version"] <- 24
isaOSR[
## Update oSR in ISA object.
oSR(isaObject1) <- isaOSR
## Check the updated oSR.
oSR(isaObject1)
#> Term Source Name Term Source File Term Source Version
#> 1 OBI http://data.bioontology.org/ontologies/OBI 24
#> 2 EFO http://data.bioontology.org/ontologies/EFO 118
#> 3 UO http://purl.obolibrary.org/obo/UO <NA>
#> 4 NCBITaxon http://data.bioontology.org/ontologies/NCBITAXON 6
#> 5 PO http://data.bioontology.org/ontologies/PO 10
#> 6 GMI http://gwas.gmi.oeaw.ac.at/ <NA>
#> Term Source Description
#> 1 Ontology for Biomedical Investigations
#> 2 Experimental Factor Ontology
#> 3 Unit Ontology
#> 4 National Center for Biotechnology Information (NCBI) Organismal Classification
#> 5 Plant Ontology
#> 6 Cataloque of Arabidopsis accessions at GMI
In a similar way all slots in an ISA
object can be accessed and updated.
The assay files may contain information about the files used to store the actual data for the assay. Per assay file two types of data files may be referred to: 1) the file(s) containing the raw data, and 2) the file(s) containing derived data.
Looking at the assay tab file in our example data, we see that the Raw Data File column is empty, no raw data files are available. However, the Derived Data File shows the file d_data.txt
.
## Inspect assay tab.
<- aFiles(isaObject1)
isaAFile head(isaAFile$a_study1.txt)
#> Sample Name Protocol REF Parameter Value[Organism part] Term Source REF Term Accession Number
#> 1 sample1 Phenotyping NA NA NA
#> 2 sample2 Phenotyping NA NA NA
#> 3 sample3 Phenotyping NA NA NA
#> 4 sample4 Phenotyping NA NA NA
#> 5 sample5 Phenotyping NA NA NA
#> 6 sample6 Phenotyping NA NA NA
#> Parameter Value[Trait Definition File] Assay Name Raw Data File Protocol REF
#> 1 tdf.txt assay1020 NA Data transformation
#> 2 tdf.txt assay1 NA Data transformation
#> 3 tdf.txt assay1131 NA Data transformation
#> 4 tdf.txt assay569 NA Data transformation
#> 5 tdf.txt assay293 NA Data transformation
#> 6 tdf.txt assay388 NA Data transformation
#> Derived Data File
#> 1 d_data.txt
#> 2 d_data.txt
#> 3 d_data.txt
#> 4 d_data.txt
#> 5 d_data.txt
#> 6 d_data.txt
To read the contents of the data files, either raw or derived, in the assay tab file, we can use the processAssay()
function. The exact working of this function depends on the technology type of the assay. For most technology types the data files are read as plain .txt
files assuming a tab-delimited format. Only for mass spectrometry and microarray data the files are read differently (see the sections below). As the output above shows, the assay file in the example has a Data Transformation technology and is therefore read as tab-delimited file.
Before being able to process the assay file, i.e. read the data, we first have to extract the assay tabs using the getAssayTabs()
function. This function extracts all the assay files from an ISA
object and stores them as assayTab
objects. These assayTab
objects contain not only the content of the assay tab file, but also extra information, e.g. technology type.
## Get assay tabs for isaObject1.
<- getAssayTabs(isaObject1)
aTabObjects
## Process assay data.
<- processAssay(isaObject = isaObject1,
isaDat aTabObject = aTabObjects$s_study1.txt$a_study1.txt,
type = "derived")
## Display first rows and columns.
head(isaDat[, 1:10])
#> Assay Name LD LDV SD SDV FT10 FT16 FT22 Seed Dormancy Emco5
#> 1 assay152 6.84105 32.6 93.0417 4.97494 74 87 89 NA NA
#> 2 assay279 NA NA NA NA NA NA NA NA NA
#> 3 assay211 NA NA NA NA NA NA NA NA NA
#> 4 assay256 NA NA NA NA NA NA NA NA NA
#> 5 assay907 NA NA NA NA NA NA NA NA NA
#> 6 assay948 NA NA NA NA NA NA NA NA NA
The data is now stored in isaDat
and can be used for further analysis within R
.
Mass spectrometry data is often stored in Network Common Data Form (NetCDF) files, i.e. in .CDF files. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the xcms package is required. This package is available from Bioconductor.
As an example for the processing of mass spectrometry files we will use a subset of the quantitated LC/MS peaks from the spinal cords of 6 wild-type and 6 fatty acid amide hydrolase (FAAH) knockout mice described in Saghatelian et al. (2004). A more extensive version of this data set is available in the faahKO data package on Bioconductor.
## Read ISA-Tab files for faahKO.
<- readISATab(path = file.path(system.file("extdata/faahKO", package = "isatabr"))) isaObject3
After reading the ISA-Tab files, we can now process the mass spectrometry assay data. In this example the raw data is available, so when processing the assay we specify type = "raw"
. The rest of the code is similar to the previous section.
## Get assay tabs for isaObject3.
<- getAssayTabs(isaObject3)
aTabObjects3
## Process assay data.
<- processAssay(isaObject = isaObject3,
isaDat3 aTabObject = aTabObjects3$s_Proteomic_profiling_of_yeast.txt$a_metabolite.txt,
type = "raw")
## Display output.
isaDat3#> An "xcmsSet" object with 1 samples
#>
#> Time range: 2506.1-4132.1 seconds (41.8-68.9 minutes)
#> Mass range: 200.1-599.3129 m/z
#> Peaks: 470 (about 470 per sample)
#> Peak Groups: 0
#> Sample classes:
#>
#> Feature detection:
#> o Peak picking performed on MS1.
#> Profile settings: method = bin
#> step = 0.1
#>
#> Memory usage: 0.0804 MB
As the output shows, processing the mass spectrometry data gives an object of class xcmsSet
from the xcms
package. This object contains all available information from the .CDF file that was read and can be used for further analysis.
Microarray data is often stored in an Affymetrix Probe Results file. These .CEL
files contain information on the probe set’s intensity values, and a probe set represents a gene. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the affy package is required. This package is available from Bioconductor.
Processing microarray data is done in a very similar way as processing mass spectrometry data, as described in the previous section. The main difference is that the resulting object will in this case be an object object of class ExpressionSet
, which is used as input in many Bioconductor packages.
After updating an ISA
object, it can be written back to a directory using the writeISAtab()
function. All content of the ISA
object will be written to investigation, study and assay files following the ISA-Tab standard for file specification. By default the files are written to the current working directory, but the directory can be specified using the path
argument.
## Write content of ISA object to a temporary directory.
writeISAtab(isaObject = isaObject1,
path = tempdir())
Note that existing files are always overwritten. Therefore, writing files to the same directory, from where the original files were read, will result in the original files being overwritten.
Besides writing the full ISA
object it is also possible to write only the investigation file , one or more study files or one or more assay files.
## Write investigation file.
writeInvestigationFile(isaObject = isaObject1,
path = tempdir())
## Write study file.
writeStudyFiles(isaObject = isaObject1,
studyFilenames = "s_study1.txt",
path = tempdir())
## Write assay file.
writeAssayFiles(isaObject = isaObject1,
assayFilenames = "a_study1.txt",
path = tempdir())