text mining

Knowledge Graphs, Semantic Web and Drug Safety

July 12, 2019 by Jose Rossello 1 Comment

Second part of: Minin g PubMed for Drug Induced Acute Kidney Injury

When I wrote “Mining PubMed for Drug Induced Acute Kidney Injury”, my intention was to start exploring the use of PubMed for knowledge discovery in the fields of drug safety and pharmacovigilance. But to discover new knowledge, you need to know what is already known, what has been discovered already.

Using our example of drug induced acute kidney injury (AKI), if we want to discover new associations, we should be aware of which drugs are known to increase the risk of renal damage, or to worsen renal function on an already impaired kidney.

For marketed, prescription drugs, we can use the FDA labels as a reference of what adverse reactions are already known for a specific product, and check them against our PubMed search, for knowledge discovery.

How can we reach that goal? First, it is helpful to know that the FDA provides us with the labels of all approved products, in xml format. To download FDA labels click here.

To understand this approach, we need to talk about a variety of concepts and how they can help us to reach our objectives:

Semantic Web

The Semantic Web is a Web 3.0 technology. It is a way of connecting data between entities or systems that allows for rich, self-describing interactions of data available worldwide across the Internet. Nowadays, the majority of information provided by the Internet is delivered in the form of web pages. These documents are linked each other through the use of hyperlinks. Humans or machines can read these documents. But machines, other than finding keywords on a page, have difficulties extracting any meaning from these documents.

The semantic web will open the web of data to artificial intelligence processes, it seeks to encourage people to publish their data in an open standard format, at the same time that encourages Internet users to analyze these data and gain knowledge.

The Graph Database

The graph database is the way the semantic web stores data. The Resource Description Framework, or RDF, constitutes the building blocks for forming the web of semantic data, and it defines a type of database which is called a graph database.

Data can be stored in the form of triples. A triple describes the breaking of an RDF statement into its 3 constituent parts: the subject, the predicate (or property), and the object of the statement. For example, we want to define the color of a capsule for a medicinal product:

In terms of this simple graph, the subject is the capsule; the predicate (or property) is color; and the object is red. That’s why this is called a “triple”, and the information is stored in triples.

Semantic Modeling

RDF offers a flexible, graph-based model for recording data that is interchangeable globally, and this is the beauty of it. However, it does not offer any means of recording semantics, or meaning.

We want to include semantics in data, for the purpose of knowledge integration. One of the most important benefits of adding semantic meaning to our data is that it can be bridged across domains of knowledge automatically. For example, suppose we have two websites, one of them stores information about product labels, including all adverse reactions, and the other stores information about treatments given to a specific group of patients. Although these 2 sites have been created independently, the information they provide is complementary.

In principle, any sharing of data between the 2 sites cannot be done, in principle, by joining tables in their databases. This is because they have been designed independently, and because they are using different database server systems, which are not cross-compatible. This type of information interchange across incompatible, independently defined data systems takes time, money, and human contextual interpretation of the different sources of data. It is also limited to these 2 websites / datasets. Any further additions to their knowledge from elsewhere would require a similar effort.

With the introduction of semantics and RDF, all this is much easier to do. How do we model the two site scenario using semantic modeling? To begin with, the 2 sites need to apply a common, standard vocabulary (a collection of terms with a well-defined meaning that is consistent across contexts). This can be done if the two sites adopt the same ontology (to define contextual relationships behind a defined vocabulary), for expressing the meaning behind the data they expose, and publishing the data on an endpoint which can be queried, so that the sites can communicate with each other across the web.

NOTE: Currently you can download from the Web thousands of databases encoded as triples. Among the largest ones we highlight DBpedia, which is the triple-store version of Wikipedia.

Example Applied to Drug Safety – Drug Induced Acute Kidney Injury

In this example we can see how different databases containing partial health-related information that are conceptually interconnected, can be linked for knowledge discovery.

Data from SNOMED (global common language for health terms), MeSH (Medical Subject Headings, a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences), SIDER (Side Effect Resource), DailyMed (drug brand names and FDA product labels), ClinicalTrials.gov (web data source for clinical trials), DrugBank (comprehensive data about medicinal products), and the Diseasome (integrated database of genes, genetic variation, and diseases), along with any patient record data, or even PubMed data can be interlinked and queried. It opens a myriad of opportunities. And this is just a small example of what we can do.

Graphic-Based, Triple-Store Browser

We are going to use a tool to display visual graphs of subsets of a store’s nodes and their links. It is an interactive tool for browsing, querying, and editing triple-stores, also known as graph databases.

On the previous post of this series, we found 8,916 PubMed abstracts for the search of drug induced acute kidney injury. We downloaded all the abstracts as an xml file. Some applications are able to obtain triples from them, in such a way that allows us to analyze them graphically. In this case, we got 2,705,300 triples from the mentioned PubMed results.

A simple example of it is shown on the next picture. We wanted to know how many abstracts were talking about “acute kidney injury”. By searching that keyword, the tool delivered 599 nodes (abstracts) and 300 links:

Abstracts are represented by yellow boxes

If we zoom in, we will see this:

Some abstracts using the keyword “acute kidney injury”

Let´s see how the triples look like in this graph database. Remember that we have converted the xml file into a triple-store, and that triples consist of Subject, Predicate, Object. Following there is a list of the 83 predicates extracted from the PubMed xml file we are using for this example, in alphabetical order:

Follow this link if you want to learn more on PubMed XML Element Descriptions and their attributes.

These predicates are the properties of each one of the articles we retrieved. The subject would be a unique identifier for each article, the predicate is one among the previous list, and the object is the value of the predicate for that specific article.

Some sample triples from the dataset are shown here:

Next, we can see the first triples of the dataset, where column “s” is for subjects, “p” is for predicates, and “o” is for objects:

The first subject is _:bE83C8647x3432, corresponds to a specific article. The corresponding predicate (property) is UI, and the object is D016428. In case you want to know, this element is used to identify they type of article indexed by MEDLINE. There is a code for each type of article. “D016428” is the code for the object “Journal Article”. Records may contain more than one publication type. In our case, this record contains just one publication type. In xml, it looks like this:

<PublicationTypeList>
<PublicationType UI=”D016428″>Journal Article</PublicationType>
….
</PublicationTypeList>

When we click on ” _:bE83C8647x3432″, this is what we get all the statements with that code as the subject. It shows the predicates associated to it, and the objects associated to the predicates:

In this post, we have talked about knowledge graphs, semantic web, triples, and have shown some of them, applied directly to our PubMed search on drug induced acute kidney injury. The next post will show more about it, and more results from including other, completely different, sources of data.

Jose Rossello

Mining PubMed for Drug Induced Acute Kidney Injury

March 11, 2019 by Jose Rossello 1 Comment

Enhancing signal detection capabilities beyond regular literature search

Methods and tools for data mining and all its variants, namely text mining and web mining, are emerging at cosmic speeds. But their implementation in pharmacovigilance and pharmacoepidemiology is still on its early stages.

The aim of this post is to explore and apply some of the current methods and tools using PubMed as the primary source for text mining. For this exercise I have chosen to mine PubMed abstracts for drug-induced acute kidney injury.

Searching for abstracts in PubMed

For this purpose, I used the PubMed Advanced Search Builder, which generated this search string: “(drug induced) AND acute kidney injury”, as shown here:

If you want to go directly to the results from that search, you can use https://www.ncbi.nlm.nih.gov/pubmed?term=(drug%20induced)%20AND%20acute%20kidney%20injury

At the time of writing this post, there were 8916 results from that search. The next step was to download all the abstracts into a text file, as shown on this screenshot:

Mining Abstracts with pubmed.mineR

Obviously, nobody has the time to read all the almost nine thousand abstracts. And if we had the time to do it, we would not have the ability, as human beings, to digest and integrate all this knowledge.

To help us with the task of knowledge discovery, we are going to use some applications in R language for the purpose of mining the text we have extracted. And this is when fun begins.

The R package we will use here is pubmed.mineR. The latest information on this package can be found here. To run the code I have used RStudio.

Package pubmed.mineR has many capabilities, most of them are not shown here. I have identified which of them would be more interesting for pharmacovigilance mining.

The initial code is shown below. In this post, code has a gray background, and the output a light blue background.

It starts by installing the package, and setting up the directory on your computer for input-output. I have used mine, but you will have to change it for your own path. The next step is to call the library.

# Install package:
install.packages(“pubmed.mineR”)
# Set directory:
setwd(“D:/PharmacovigilanceAnalytics.com/pubmed.mineR”)
# Call library(ies)
library(pubmed.mineR)
library(data.table)
# readabs will automatically read the abstracts from the pubmed file (pubmed_result.txt) and will write an S4 object which I named ‘akidrug’
akidrug <- readabs(“pubmed_result.txt”)
# printing first and last abstracts from akidrug:
printabs(akidrug)

The output resulting from ‘printabs(akidrug)’ is here, showing the first and the last abstracts:

Number of Abstracts 8916
Starts with
Renal Damaging Effect Elicited by Bicalutamide Therapy Uncovered Multiple Action Mechanisms As Evidenced by the Cell Model. Peng CC(1), Chen CY(2), Chen CR(3), Chen CJ(2), Shen KH(4)(5), Chen KC(6)(7)(8), Peng RY(9). Author information: (1)Graduate Institute of Clinical Medicine, School of Medicine, College of Medicine, Taipei Medical University, 250 Wu-Hsing Street, Taipei, 11031, Taiwan. (2)Wayland Academy, 101 North University Avenue, Beaver Dam, WI, 53916, USA. (3)International Medical Doctor Program, The Vita-Salute San Raffaele University, Via Olgettina 58, 20132, Milano, Italy. (4)Division of Urology, Department of Surgery, Chi Mei Medical Center, Tainan, 710, Taiwan. (5)Department of Optometry, College of Medicine and Life Science, Chung Hwa University of Medical Technology, Tainan, 717, Taiwan. (6)Graduate Institute of Clinical Medicine, School of Medicine, College of Medicine, Taipei Medical University, 250 Wu-Hsing Street, Taipei, 11031, Taiwan. kuanchou@tmu.edu.tw. (…
ADT-induced hypogonadism was reported to have the potential to lead to acute kidney injury (AKI).
ADT was also shown to induce bladder fibrosis via induction of the transforming growth factor (TGF)-Î² level.

Ends with
[APROPOS OF 8 CASES OF CARBON TETRACHLORIDE POISONING]. [Article in French] VEREERSTRAETEN P, VERNIORY A, VEREERSTRAETEN J, TOUSSAINT C, VERBANCK M, LAMBERT PP. NA NA

Word atomization

Something we can do is to determine the word frequency. For this purpose, pubmed.mineR uses “word_atomizations”:

akidrug_words <- word_atomizations(akidrug)
# Print the first 10 words by frequency
akidrug_words[1:10,]

The following table shows the first ten most frequent words. As expected, these most frequent words refer to the acute kidney injury aspect of your PubMed search. Please keep into account that word counting is one of the fundamental basis of text mining. Word counting contains still a very important research opportunity. I suggest to analyze, from the list generated by this example, word counts that are not as obvious as “renal”, “kidney”, or “patient” for this specific type of search.

ID Number	Word	Frequency
53805	renal	19824
18468	acute	9478
40387	kidney	8584
38691	injury	8236
49268	patients	7712
53138	rats	5519
32372	failure	5451
60217	treatment	4861
34861	group	4004
60509	tubular	3701

Gene atomization

Gene atomization will automatically fetch the genes (HGNC approved Symbol) from the text and report their frequencies.

# If you remember, akidrug is the name of the file for the collection of abstracts. Akidrug_gene will be the collection of genes found in those abstracts
akidrug_gene <- gene_atomization(akidrug)
# Next, we will obtain a subset of akidrug_gene containing 2 variables, one for the gene symbol and the other for the frequency
genes_table <- subset(akidrug_gene, select = c(“Gene_symbol”,”Freq”))
# Next, we prepare the whole gene database. The complete set can be obtained from the HGNC site.
hgnc<-read.delim(“D:/PharmacovigilanceAnalytics.com/pubmed.mineR/hgnc_complete_set.txt”,
header = T,stringsAsFactors = F)

We want to extract sentences containing Alias of the Human Genes, from the PubMed abstracts:

alias_fn(genes_table,hgnc,akidrug,”output”,c(“drug induced”,”acute kidney injury”,”adverse event”))

A sample from the results (saved to “outputalias”) is shown here:

TNF TNF-alpha
C3 C3b
PAH PH
PARP1 PARP
26184635
However, it is still unclear whether PARP overactivation happens during acute kidney injury (AKI) caused by endotoxic shock (ES).
¹

And another one:

BAK1 BAK
CD5 T1
CR1 KN
ICAM1 CD54
IL18 IL-18
30531196
Other biomarkers of drug-induced kidney toxicity that have been detected in the urine of rodents or patients include IL-18 (interleukin-18), NGAL (neutrophil gelatinase-associated lipocalin), Netrin-1, liver type fatty acid binding protein (L-FABP), urinary exosomes, and TIMP2 (insulin-like growth factor -binding protein 7)/IGFBP7 (insulin-like growth factor binding protein 7), also known as NephroCheck®, the first FDA-approved biomarker testing platform to detect acute kidney injury (AKI) in patients.
²

1.
Liu S, Liu J, Liu D, Wang X, Yang R. Inhibition of Poly-(ADP-Ribose) Polymerase Protects the Kidney in a Canine Model of Endotoxic Shock. Nephron. 2015;130(4):281-292. https://www.ncbi.nlm.nih.gov/pubmed/26184635.
2.
Griffin B, Faubel S, Edelstein C. Biomarkers of drug-induced kidney toxicity. Ther Drug Monit. December 2018. https://www.ncbi.nlm.nih.gov/pubmed/30531196.

Literature Curation with PubTator Functionality

PubTator is a Web-based tool for accelerating manual literature curation (e.g. annotating biological entities and their relationships) through the use of advanced text-mining techniques. As an all-in-one system, PubTator provides one-stop service for annotating PubMed citations.

PubMed.mineR has a PubTator function. The PubTator function uses a PMID as entry and delivers results regarding chemicals, diseases, genes, and mutations, if they are referenced in the article. We are going to use the article by Griffin (see article 2 above, PIMD: 30531196) Let’s try it and see what hppens:

# Run PubTator function on PIMD 30531196 and save results on pubtator_output:
pubtator_output <- pubtator_function(30531196)
# Print PubTator output for chemicals, diseases, genes, and mutations:
pubtator_output$Chemicals
pubtator_output$Diseases
pubtator_output$Genes
pubtator_output$Mutations

Results are here:

Literature Curation with PubTator Functionality

There are many other pubmed.mineR functionalities. I encourage the reader to explore them and comment on the comments section of this post.

Exploration of other R packages.
Articles Published by Year and Word Cloud

This section is inspired on the code presented here.

library(RISmed)
library(dplyr)
library(ggplot2)
library(tidytext)
library(wordcloud)
result <- EUtilsSummary(“(drug induced) AND acute kidney injury”,
type = “esearch”,
db = “pubmed”,
datetype = “pdat”,
retmax = 30000,
mindate = 1960,
maxdate = 2019)
fetch <- EUtilsGet(result, type = “efetch”, db = “pubmed”)

abstracts <- data.frame(title = fetch@ArticleTitle,
abstract = fetch@AbstractText,
journal = fetch@Title,
DOI = fetch@PMID,
year = fetch@YearPubmed)
abstracts <- abstracts %>% mutate(abstract = as.character(abstract))
abstracts %>%
head()
abstracts %>%
group_by(year) %>%
count() %>%
filter(year > 1959) %>%
ggplot(aes(year, n)) +
geom_point() +
geom_line() +
labs(title = “Pubmed articles with search terms (drug induced) AND acute kidney injury \n1960-2019″, hjust = 0.5,
y = “Articles”)
cloud <- abstracts %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
cloud %>%
with(wordcloud(word, n, min.freq = 15, max.words = 500, colors = brewer.pal(8, “Dark2”)), scale = c(8,.3), per.rot = 0.4)

word cloud for drug-induced acute kidney injury

This is the first of a series of posts analyzing text mining applications for PubMed. The second one explores knowledge graphs and semantic analytics.