Contents

1 Introduction

This project is a parser, validator and normalizer implementation for shorthand lipid nomenclatures, base on the Grammar of Succinct Lipid Nomenclatures project.

Goslin defines multiple grammars for different sources of shorthand lipid nomenclature. This allows to generate parsers based on the defined grammars, which provide immediate feedback whether a processed lipid shorthand notation string is compliant with a particular grammar, or not.

NOTE: Please report any issues you might find to help improve it!

Here, rgoslin 2.0 uses the Goslin grammars and the cppgoslin parser to support the following general tasks:

  1. Facilitate the parsing of shorthand lipid names dialects.
  2. Provide a structural representation of the shorthand lipid after parsing.
  3. Use the structural representation to generate normalized names, following the latest shorthand nomenclature.

1.2 Supported nomenclatures

1.3 Changes from Version 1

  • The fragment grammar parser has been removed but may be re-added in the future. Please use our GitHub Issues if you need this parser.
  • The column names of the R implementation have been changed to contain “.” instead of ” ” (space). This makes handling and access easier, since quoting is no longer necessary.

2 Installation

The package can be installed with the BiocManager package as follows:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rgoslin")

3 Example use cases

In order to use the provided translation functions of rgoslin, you first need to load the library. If rgoslin is not yet available, please follow the instructions in the previous section on “Installation”.

library(rgoslin)

If you want to check, which grammars are supported, use the following command:

listAvailableGrammars()
#> [1] "Shorthand2020" "Goslin"        "FattyAcids"    "LipidMaps"    
#> [5] "SwissLipids"   "HMDB"

To check, whether a given lipid name can be parsed by any of the parsers, you can use the isValidLipidName method. It will return TRUE if the given name can be parsed by any of the available parsers and FALSE if the name was not parseable.

isValidLipidName("PC 32:1")
#> [1] TRUE

3.1 Parsing a single lipid name

Using parseLipidNames with a lipid name returns a data frame of properties of the parsed lipid name as columns.

df <- parseLipidNames("PC 32:1")
Table 1: Table 2: Lipid name parsing results for PC 32:1, FA and LCB columns omitted, since they are unpopulated (NA) on the lipid species level.
Normalized.Name Original.Name Grammar Message Adduct Adduct.Charge Lipid.Maps.Category Lipid.Maps.Main.Class Species.Name Extended.Species.Name Molecular.Species.Name Sn.Position.Name Structure.Defined.Name Full.Structure.Name Functional.Class.Abbr Functional.Class.Synonyms Level Total.C Total.OH Total.O Total.DB Mass Sum.Formula
PC 32:1 PC 32:1 Shorthand2020 NA NA 0 GP PC PC 32:1 PC NA NA NA NA [PC] [PC, GPC, GPCho, PlsCho] SPECIES 32 0 0 1 731.5465 C40H78NO8P

If you want to set the grammar to parse against manually, this is also possible as the second argument:

originalName <- "TG(16:1(5E)/18:0/20:2(3Z,6Z))"
tagDf <- parseLipidNames(originalName, grammar = "LipidMaps")
Table 3: Table 4: Lipid name parsing results for TG isomeric subspecies, FA and LCB columns omitted for brevity.
Normalized.Name Original.Name Grammar Message Adduct Adduct.Charge Lipid.Maps.Category Lipid.Maps.Main.Class Species.Name Extended.Species.Name Molecular.Species.Name Sn.Position.Name Structure.Defined.Name Full.Structure.Name Functional.Class.Abbr Functional.Class.Synonyms Level Total.C Total.OH Total.O Total.DB Mass Sum.Formula
TG 16:1(5E)/18:0/20:2(3Z,6Z) TG(16:1(5E)/18:0/20:2(3Z,6Z)) LipidMaps NA NA 0 GL TG TG 54:3 TG TG 16:1_18:0_20:2 TG 16:1/18:0/20:2 TG 16:1(5)/18:0/20:2(3,6) TG 16:1(5E)/18:0/20:2(3Z,6Z) [TG] [TG, TAG] FULL_STRUCTURE 54 0 0 3 884.7833 C57H104O6
Table 5: Table 6: Lipid name parsing results for TG isomeric subspecies with FA and LCB columns.
Normalized.Name FA1.Position FA1.C FA1.OH FA1.O FA1.DB FA1.Bond.Type FA1.DB.Positions FA2.Position FA2.C FA2.OH FA2.O FA2.DB FA2.Bond.Type FA2.DB.Positions FA3.Position FA3.C FA3.OH FA3.O FA3.DB FA3.Bond.Type FA3.DB.Positions FA4.Position FA4.C FA4.OH FA4.O FA4.DB FA4.Bond.Type FA4.DB.Positions LCB.Position LCB.C LCB.OH LCB.O LCB.DB LCB.Bond.Type LCB.DB.Positions
TG 16:1(5E)/18:0/20:2(3Z,6Z) 1 16 0 0 1 ESTER [5E] 2 18 0 0 0 ESTER [] 3 20 0 0 2 ESTER [3Z, 6Z] NA NA NA NA NA NA NA NA NA NA NA NA NA NA

3.2 Parsing multiple lipid names

If you want to parse multiple lipid names, use the parseLipidNames method with a vector of lipid names. This returns a data frame of properties of the parsed lipid names with one row per lipid.

NOTE: Omitting the grammar argument will test all available parsers, the first one to successfully parse the name wins. This will consequently take longer than explicitly setting the grammar to select the parser for.

multipleLipidNamesDf <- parseLipidNames(c("PC 32:1","LPC 34:1","TG(18:1_18:0_16:1)"))
Table 7: Table 8: Lipid name parsing results for PC 32:1, LPC 34:1, TG(18:1_18:0_16:1), FA and LCB columns omitted for brevity.
Normalized.Name Original.Name Grammar Message Adduct Adduct.Charge Lipid.Maps.Category Lipid.Maps.Main.Class Species.Name Extended.Species.Name Molecular.Species.Name Sn.Position.Name Structure.Defined.Name Full.Structure.Name Functional.Class.Abbr Functional.Class.Synonyms Level Total.C Total.OH Total.O Total.DB Mass Sum.Formula
PC 32:1 PC 32:1 Shorthand2020 NA NA 0 GP PC PC 32:1 PC NA NA NA NA [PC] [PC, GPC, GPCho, PlsCho] SPECIES 32 0 0 1 731.5465 C40H78NO8P
LPC 34:1 LPC 34:1 Shorthand2020 NA NA 0 GP LPC LPC 34:1 LPC LPC 34:1 NA NA NA [LPC] [LPC, LysoPC, lysoPC] MOLECULAR_SPECIES 34 0 0 1 745.5985 C42H84NO7P
TG 18:1_18:0_16:1 TG(18:1_18:0_16:1) LipidMaps NA NA 0 GL TG TG 52:2 TG TG 18:1_18:0_16:1 NA NA NA [TG] [TG, TAG] MOLECULAR_SPECIES 52 0 0 2 858.7676 C55H102O6
Table 9: Table 10: Lipid name parsing results for PC 32:1, LPC 34:1, TG(18:1_18:0_16:1) with FA and LCB columns.
Normalized.Name FA1.Position FA1.C FA1.OH FA1.O FA1.DB FA1.Bond.Type FA1.DB.Positions FA2.Position FA2.C FA2.OH FA2.O FA2.DB FA2.Bond.Type FA2.DB.Positions FA3.Position FA3.C FA3.OH FA3.O FA3.DB FA3.Bond.Type FA3.DB.Positions FA4.Position FA4.C FA4.OH FA4.O FA4.DB FA4.Bond.Type FA4.DB.Positions LCB.Position LCB.C LCB.OH LCB.O LCB.DB LCB.Bond.Type LCB.DB.Positions
PC 32:1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
LPC 34:1 -1 34 0 0 1 ESTER [] -1 0 0 0 0 ESTER [] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
TG 18:1_18:0_16:1 -1 18 0 0 1 ESTER [] -1 18 0 0 0 ESTER [] -1 16 0 0 1 ESTER [] NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Finally, if you want to parse multiple lipid names and want to use one particular grammar, simply add its name as the “grammar” argument.

originalNames <- c("PC 32:1","LPC 34:1","TAG 18:1_18:0_16:1")
multipleLipidNamesWithGrammar <- parseLipidNames(originalNames, grammar = "Goslin")
Table 11: Table 12: Lipid name parsing results for Goslin grammar and lipids PC 32:1, LPC 34:1, TG(18:1_18:0_16:1), FA and LCB columns omitted for brevity.
Normalized.Name Original.Name Grammar Message Adduct Adduct.Charge Lipid.Maps.Category Lipid.Maps.Main.Class Species.Name Extended.Species.Name Molecular.Species.Name Sn.Position.Name Structure.Defined.Name Full.Structure.Name Functional.Class.Abbr Functional.Class.Synonyms Level Total.C Total.OH Total.O Total.DB Mass Sum.Formula
PC 32:1 PC 32:1 Goslin NA NA 0 GP PC PC 32:1 PC NA NA NA NA [PC] [PC, GPC, GPCho, PlsCho] SPECIES 32 0 0 1 731.5465 C40H78NO8P
LPC 34:1 LPC 34:1 Goslin NA NA 0 GP LPC LPC 34:1 LPC LPC 34:1 NA NA NA [LPC] [LPC, LysoPC, lysoPC] MOLECULAR_SPECIES 34 0 0 1 745.5985 C42H84NO7P
TG 18:1_18:0_16:1 TAG 18:1_18:0_16:1 Goslin NA NA 0 GL TG TG 52:2 TG TG 18:1_18:0_16:1 NA NA NA [TAG] [TG, TAG] MOLECULAR_SPECIES 52 0 0 2 858.7676 C55H102O6
Table 13: Table 14: Lipid name parsing results for Goslin grammar and lipids PC 32:1, LPC 34:1, TG(18:1_18:0_16:1) with FA and LCB columns.
Normalized.Name FA1.Position FA1.C FA1.OH FA1.O FA1.DB FA1.Bond.Type FA1.DB.Positions FA2.Position FA2.C FA2.OH FA2.O FA2.DB FA2.Bond.Type FA2.DB.Positions FA3.Position FA3.C FA3.OH FA3.O FA3.DB FA3.Bond.Type FA3.DB.Positions FA4.Position FA4.C FA4.OH FA4.O FA4.DB FA4.Bond.Type FA4.DB.Positions LCB.Position LCB.C LCB.OH LCB.O LCB.DB LCB.Bond.Type LCB.DB.Positions
PC 32:1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
LPC 34:1 -1 34 0 0 1 ESTER [] -1 0 0 0 0 ESTER [] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
TG 18:1_18:0_16:1 -1 18 0 0 1 ESTER [] -1 18 0 0 0 ESTER [] -1 16 0 0 1 ESTER [] NA NA NA NA NA NA NA NA NA NA NA NA NA NA

3.3 Parsing IUPAC-compliant Fatty Acid Names

LIPID MAPS has a number of fatty acids that use names following the IUPAC-IUB Fatty Acids nomenclature. These can now also be parsed and converted to the updated lipid shorthand nomenclature. We are using LMFA01020216 and LMFA08040030 as examples here:

originalNames <- c("LMFA01020216"="5-methyl-octadecanoic acid", "LMFA08040030"="N-((+/-)-8,9-dihydroxy-5Z,11Z,14Z-eicosatrienoyl)-ethanolamine")
normalizedFattyAcidsNames <- parseLipidNames(originalNames, "FattyAcids")
Table 15: Table 16: Lipid name parsing results for Goslin grammar and fatty acids 5-methyl-octadecanoic acid N-((+/-)-8,9-dihydroxy-5Z,11Z,14Z-eicosatrienoyl)-ethanolamine, some columns omitted for brevity.
Normalized.Name Original.Name Grammar Message Lipid.Maps.Category Lipid.Maps.Main.Class Species.Name Extended.Species.Name Molecular.Species.Name Sn.Position.Name Structure.Defined.Name Full.Structure.Name Functional.Class.Abbr Functional.Class.Synonyms Level Total.C Total.OH Total.O Total.DB Mass Sum.Formula FA1.Position FA1.C FA1.OH FA1.O FA1.DB FA1.Bond.Type FA1.DB.Positions
FA 18:0;5Me 5-methyl-octadecanoic acid FattyAcids NA FA FA FA 19:0 FA FA 19:0 FA 19:0 FA 18:0;Me FA 18:0;5Me [FA] [FA, fatty acid] FULL_STRUCTURE 19 0 0 0 298.2872 C19H38O2 1 18 0 0 0 ESTER []
NAE 20:3(5Z,11Z,14Z);8OH,9OH N-((+/-)-8,9-dihydroxy-5Z,11Z,14Z-eicosatrienoyl)-ethanolamine FattyAcids NA FA NAE NAE 20:3;O2 NAE NAE 20:3;O2 NAE 20:3;O2 NAE 20:3(5,11,14);(OH)2 NAE 20:3(5Z,11Z,14Z);8OH,9OH [NAE] [NAE] FULL_STRUCTURE 20 2 2 3 381.2879 C22H39NO4 1 20 2 2 3 ESTER [5Z, 11Z, 14Z]

3.4 Parsing Adducts

The Goslin parser also support reading of lipid shorthand names with adducts:

originalNames <- c("PC 32:1[M+H]1+", "PC 32:1 [M+H]+","PC 32:1")
lipidNamesWithAdduct <- parseLipidNames(originalNames, "Goslin")

This will populate the columns “Adduct” and “AdductCharge” with the respective values. Please note that we recommend to use the adduct and its charge in full IUPAC recommended nomenclature:

Table 17: Table 18: Lipid name parsing results for Goslin grammar and lipids PC 32:1[M+H]1+, PC 32:1 [M+H]+ and PC 32:1, FA and LCB columns omitted for brevity.
Normalized.Name Original.Name Grammar Message Adduct Adduct.Charge Lipid.Maps.Category Lipid.Maps.Main.Class Species.Name Extended.Species.Name Molecular.Species.Name Sn.Position.Name Structure.Defined.Name Full.Structure.Name Functional.Class.Abbr Functional.Class.Synonyms Level Total.C Total.OH Total.O Total.DB Mass Sum.Formula
PC 32:1 PC 32:1[M+H]1+ Goslin NA [M+H]1+ 1 GP PC PC 32:1 PC NA NA NA NA [PC] [PC, GPC, GPCho, PlsCho] SPECIES 32 0 0 1 732.5538 C40H79NO8P
PC 32:1 PC 32:1 [M+H]+ Goslin NA [M+H]1+ 1 GP PC PC 32:1 PC NA NA NA NA [PC] [PC, GPC, GPCho, PlsCho] SPECIES 32 0 0 1 732.5538 C40H79NO8P
PC 32:1 PC 32:1 Goslin NA NA 0 GP PC PC 32:1 PC NA NA NA NA [PC] [PC, GPC, GPCho, PlsCho] SPECIES 32 0 0 1 731.5465 C40H78NO8P

3.5 Using rgoslin with lipidr

lipidr is a Bioconductor package with specific support for QC checking, uni- and multivariate analysis and visualization of lipidomics data acquired with Skyline or from metabolomics workbench. It uses a custom implementation for lipid name handling that does not yet support the updated shorthand nomenclature.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("lipidr")

After installation of lipidr, we need to load the following libraries for this example:

library(rgoslin)
library(lipidr)
library(stringr)
library(ggplot2)

We will use lipidr’s workflow example to illustrate how to apply rgoslin in such a use-case. We read in an example dataset exported from Skyline based on a targeted lipidomics experiment:

datadir = system.file("extdata", package="lipidr")
filelist = list.files(datadir, "data.csv", full.names = TRUE) # all csv files
d = read_skyline(filelist)
#> Successfully read 3 methods.
#> Your data contain 58 samples, 10 lipid classes, 277 lipid molecules.
clinical_file = system.file("extdata", "clin.csv", package="lipidr")
d = add_sample_annotation(d, clinical_file)

This dataset contains the lipid names in the Molecule column and the name preprocessed with lipidr in the clean_name column. In this excerpt, you can see PE 34:1 NEG in row 5, which indicates measurement of this lipid in negative mode. Please note that this is specific to this example and not a generally applied naming convention.

Table 19: Table 20: Subset of first ten rows of row data.
filename Molecule Precursor.Mz Precursor.Charge Product.Mz Product.Charge clean_name ambig not_matched istd class_stub chain1 l_1 s_1 chain2 l_2 s_2 chain3 l_3 s_3 chain4 l_4 s_4 total_cl total_cs Class
A1_data.csv PE 32:0 692.5 1 551.5 1 PE 32:0 FALSE FALSE FALSE PE 32:0 32 0 NA NA NA NA NA NA 32 0 PE
A1_data.csv PE 32:1 690.5 1 549.5 1 PE 32:1 FALSE FALSE FALSE PE 32:1 32 1 NA NA NA NA NA NA 32 1 PE
A1_data.csv PE 32:2 688.5 1 547.5 1 PE 32:2 FALSE FALSE FALSE PE 32:2 32 2 NA NA NA NA NA NA 32 2 PE
A1_data.csv PE 34:1 718.5 1 577.5 1 PE 34:1 FALSE FALSE FALSE PE 34:1 34 1 NA NA NA NA NA NA 34 1 PE
A1_data.csv PE 34:1 NEG 716.5 1 196.0 1 PE 34:1 FALSE FALSE FALSE PE 34:1 34 1 NA NA NA NA NA NA 34 1 PE
A1_data.csv PE 34:2 716.5 1 575.5 1 PE 34:2 FALSE FALSE FALSE PE 34:2 34 2 NA NA NA NA NA NA 34 2 PE
A1_data.csv PE 34:3 714.5 1 573.5 1 PE 34:3 FALSE FALSE FALSE PE 34:3 34 3 NA NA NA NA NA NA 34 3 PE
A1_data.csv PE 36:0 748.6 1 607.6 1 PE 36:0 FALSE FALSE FALSE PE 36:0 36 0 NA NA NA NA NA NA 36 0 PE
A1_data.csv PE 36:1 746.6 1 605.6 1 PE 36:1 FALSE FALSE FALSE PE 36:1 36 1 NA NA NA NA NA NA 36 1 PE
A1_data.csv PE 36:1 NEG 744.6 1 196.0 1 PE 36:1 FALSE FALSE FALSE PE 36:1 36 1 NA NA NA NA NA NA 36 1 PE

Now, let’s try to parse the clean lipid names to enrich the data table.

Note: In this case, we expect to see error messages, since some lipid names use a) unsupported head group names or b) unsupported suffixes to indicate isotopically labeled lipids.

lipidNames <- parseLipidNames(rowData(d)$clean_name)
#> Encountered an error while parsing 'So1P 17:1': Expecting a single string value: [type=character; extent=4].
#> Encountered an error while parsing 'So1P 18:1': Expecting a single string value: [type=character; extent=4].

We see that lipidr’s example dataset uses the (d9) suffix to indicate isotopically labeled lipids (d=Deuterium) that are used as internal standards for quantification. We need to convert Sa1P, sphinganine-1-phosphate, to SPBP, as well as So1P, sphingosine-1-phosphate to align with recent nomenclature updates.

# lipidr stores original lipid names in the Molecule column
old_names <- rowData(d)$Molecule
# split lipid prefix from potential (d7) suffix for labeled internal standards
new_names <- rowData(d)$clean_name %>% str_match(pattern="([\\w :-]+)(\\(\\w+\\))?")
# extract the first match group (the original word is at column index 1)
normalized_new_names <- new_names[,2] %>% str_replace_all(c("Sa1P"="SPBP","So1P"="SPBP")) %>% parseLipidNames(.)

We will receive a number of warnings in the next step, since lipidr currently checks lipid class names against a predefined, internal list that does not contain the updated class names according to the new shorthand nomenclature. The updated Molecule names will however appear in downstream visualizations.

updated <- update_molecule_names(d, old_names, normalized_new_names$Normalized.Name)
#> Joining with `by = join_by(Molecule)`
#> Warning in annotate_lipids(updated_names$Molecule): Some lipid names couldn't be parsed because they don't follow the pattern 'CLS xx:x/yy:y' 
#>     PE O-34:0, PE O-34:1, PE O-34:2, PE O-36:0, PE O-36:1, PE O-36:2, PE O-36:3, PE O-36:4, PE O-36:5, PE O-38:0, PE O-38:1, PE O-38:2, PE O-38:3, PE O-38:4, PE O-38:5, PE O-40:5, PE O-40:6, PE O-40:7, PE O-32:1, PE O-34:3, PE O-38:6, PE O-38:7, PI O-34:1, SPBP 18:0;OH, SPBP 17:1;O2, SPBP 18:1;O2, SM 18:1;O2, SM 18:0;O2, Cer 18:0;O2, Cer 18:1;O2, Cer 18:2;O2, PC O-32:0, PC O-34:1, PC O-34:2, PC O-34:3, PC O-36:2, PC O-36:3, PC O-36:4, PC O-36:5, PC O-38:0, PC O-38:2, PC O-38:3, PC O-38:4, PC O-38:5, PC O-40:0, PC O-40:1, PC O-40:2, PC O-40:3, PC O-40:4, PC O-40:5, PC O-40:6, PC O-40:7, PC O-32:1, PC O-34:4, PC O-36:6, PC O-38:1, PC O-38:6, PC O-38:7
#> Joining with `by = join_by(Molecule)`

We can augment the class column using rgoslin’s LipidMaps main class, since row order in rgoslin’s output is the same as in its input. Additionally, further information may be interesting to include, such as the mass and sum formula of the uncharged lipids. We will select the same lipid classes as in the lipidr targeted lipidomics vignette: Ceramides (Cer), Lyso-Phosphatidylcholines (LPC) and Phosphatidylcholines (PC).

rowData(updated)$Class <- normalized_new_names$Lipid.Maps.Main.Class
rowData(updated)$Category <- normalized_new_names$Lipid.Maps.Category
rowData(updated)$Molecule <- normalized_new_names$Normalized.Name
rowData(updated)$LipidSpecies <- normalized_new_names$Species.Name
rowData(updated)$Mass <- normalized_new_names$Mass
rowData(updated)$SumFormula <- normalized_new_names$Sum.Formula
# select Ceramides, Lyso-Phosphatidylcholines and Phosphatidylcholines (includes plasmanyls and plasmenyls)
lipid_classes <- rowData(updated)$Class %in% c("Cer","LPC", "PC")
d <- updated[lipid_classes,]

In the next step, we use a non-exported method provided by lipidr to convert the row data into a format more suitable for plotting with ggplot. We will plot the area distribution of lipid species as boxplots, colored by lipid class and facetted by filename, similar to some plot examples in lipidr’s vignette.

ddf <- lipidr:::to_long_format(d)
ggplot(data=ddf, mapping=aes(x=Molecule, y=Area, fill=Class)) + geom_boxplot() + facet_wrap(~filename, scales = "free_y") + scale_y_log10() + coord_flip()
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
#> Warning: Removed 15 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).