What is the built-in data and how to use?

Notes:
  • All results are based on dnet (version 1.0.7).
  • R scripts (i.e. R expressions) plus necessary comments are highlighted in light-cyan background, and the rest are outputs in the screen.
  • Images displayed below may be distorted, but should be normal in your screen.
  • Functions contained in dnet 1.0.7 are hyperlinked in-place and also listed on the right side.
  • Key texts are underlined, in bold and in pumpkin-orange color.
  •       
    # These built-in data are the backend of various analytical utilities supported in the dnet package, spanning a wide range of the known gene-centric knowledge across well-studied organisms. They are provided as RData-formatted files which are regularly updated. Also, we will populate them by adding new knowledge, for example, upon request by users. The built-in RData are summarised in brief and available in the RData page. # Usually, the users do not need to download them by self for use. Instead, the users are encouraged to understand what they want to use by simply looking up the keywords in the Documentations page. The package has functions to import them or deal with them directly. # The function dRDataLoader allows the users to import what they want to use. # For the ease to use, organism-specific data start with 'org', followed by the specific organim ('Hs' for human), and the data content: only 'eg' means information about Entrez Genes, and further appendix (for example, 'GOBP') means information about their annotations by Gene Ontology Biological Process (GOBP). ## load human Entrez Genes (EG), and list the first 3 genes org.Hs.eg <- dRDataLoader(RData='org.Hs.eg')
    'org.Hs.eg' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.eg.RData) has been loaded into the working environment
    org.Hs.eg$gene_info[1:3,]
    GeneID Symbol description chromosome map_location 1 1 A1BG alpha-1-B glycoprotein 19 19q13.4 2 2 A2M alpha-2-macroglobulin 12 12p13.31 3 3 A2MP1 alpha-2-macroglobulin pseudogene 1 12 12p13.31 Synonyms 1 A1B|ABG|GAB|HYST2477 2 A2MD|CPAMD5|FWP007|S863-7 3 A2MP dbXrefs 1 MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|HPRD:00726|Vega:OTTHUMG00000183507 2 MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|HPRD:00072|Vega:OTTHUMG00000150267 3 HGNC:HGNC:8|Ensembl:ENSG00000256069
    ## load annotations of human Entrez Genes by Gene Ontology Biological Process (GOBP), inspect the content and list the first 3 terms org.Hs.egGOBP <- dRDataLoader(RData='org.Hs.egGOBP')
    'org.Hs.egGOBP' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egGOBP.RData) has been loaded into the working environment
    names(org.Hs.egGOBP)
    [1] "gs" "set_info"
    org.Hs.egGOBP$set_info[1:3,]
    setID name namespace distance GO:0000002 GO:0000002 mitochondrial genome maintenance Process 6 GO:0000003 GO:0000003 reproduction Process 2 GO:0000011 GO:0000011 vacuole inheritance Process 6
    ## load annotations of human Entrez Genes by Disease Ontology (DO), inspect the content and list the first 5 terms org.Hs.egDO <- dRDataLoader(RData='org.Hs.egDO')
    'org.Hs.egDO' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egDO.RData) has been loaded into the working environment
    org.Hs.egDO$set_info[1:3,]
    setID name namespace distance DOID:0001816 DOID:0001816 angiosarcoma Disease_Ontology 5 DOID:0002116 DOID:0002116 pterygium Disease_Ontology 7 DOID:0014667 DOID:0014667 disease of metabolism Disease_Ontology 1
    ## load phylostratific age (PS) information of human Entrez Genes, inspect the content and list all our ancestors org.Hs.egPS <- dRDataLoader(RData='org.Hs.egPS')
    'org.Hs.egPS' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egPS.RData) has been loaded into the working environment
    org.Hs.egPS$set_info
    setID name namespace distance 3 3 2759:Eukaryota superkingdom 0.00000000 4 4 33154:Opisthokonta no rank 0.02227541 5 5 33154:Opisthokonta no rank 0.02677301 6 6 33154:Opisthokonta no rank 0.03026936 7 7 33154:Opisthokonta no rank 0.03573534 8 8 33154:Opisthokonta no rank 0.03880849 9 9 33208:Metazoa kingdom 0.04949159 10 10 33208:Metazoa kingdom 0.06686750 11 11 33208:Metazoa kingdom 0.09260898 12 12 6072:Eumetazoa no rank 0.10459007 13 13 6072:Eumetazoa no rank 0.11176118 14 14 33213:Bilateria no rank 0.12058364 15 15 33213:Bilateria no rank 0.12660301 16 16 33511:Deuterostomia no rank 0.13884801 17 17 33511:Deuterostomia no rank 0.14852778 18 18 7711:Chordata phylum 0.15759842 19 19 7742:Vertebrata no rank 0.16953129 20 20 117571:Euteleostomi no rank 0.18295445 21 21 8287:Sarcopterygii no rank 0.18554672 22 22 32523:Tetrapoda no rank 0.18855901 23 23 32524:Amniota no rank 0.19241034 24 24 40674:Mammalia class 0.19552877 25 25 32525:Theria no rank 0.19917128 26 26 9347:Eutheria no rank 0.20262687 27 27 1437010:Boreoeutheria no rank 0.20409224 29 29 9443:Primates order 0.20521882 30 30 9443:Primates order 0.20708817 32 32 314293:Simiiformes infraorder 0.21351030 33 33 9526:Catarrhini parvorder 0.21636349 34 34 314295:Hominoidea superfamily 0.21875281 35 35 9604:Hominidae family 0.22019688 36 36 207598:Homininae subfamily 0.22313298 37 37 9606:Homo sapiens species 0.23340461
    ## load domain superfamily (SF) information of human Entrez Genes, inspect the content and list the first 3 superfamilies org.Hs.egSF <- dRDataLoader(RData='org.Hs.egSF')
    'org.Hs.egSF' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egSF.RData) has been loaded into the working environment
    org.Hs.egSF$set_info[1:3,]
    setID name namespace distance 46458 46458 Globin-like sf a.1.1 46548 46548 alpha-helical ferredoxin sf a.1.2 46561 46561 Ribosomal protein L29 (L29p) sf a.2.2
    ## load KEGG pathways for human Entrez Genes, inspect the content and list the first 3 pathways org.Hs.egMsigdbC2KEGG <- dRDataLoader(RData='org.Hs.egMsigdbC2KEGG')
    'org.Hs.egMsigdbC2KEGG' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egMsigdbC2KEGG.RData) has been loaded into the working environment
    org.Hs.egMsigdbC2KEGG$set_info[1:3,]
    setID name namespace M10462 M10462 KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY C2 M1053 M1053 KEGG_HEDGEHOG_SIGNALING_PATHWAY C2 M10680 M10680 KEGG_PROTEASOME C2 distance M10462 Adipocytokine signaling pathway M1053 Hedgehog signaling pathway M10680 Proteasome
    ## load the network for human Entrez Genes as an 'igraph' object org.Hs.string <- dRDataLoader(RData='org.Hs.string')
    'org.Hs.string' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.string.RData) has been loaded into the working environment
    org.Hs.string
    IGRAPH UN-- 18492 728141 -- + attr: name (v/c), seqid (v/c), geneid (v/n), symbol (v/c), | description (v/c), neighborhood_score (e/n), fusion_score (e/n), | cooccurence_score (e/n), coexpression_score (e/n), experimental_score | (e/n), database_score (e/n), textmining_score (e/n), combined_score | (e/n) + edges (vertex names): [1] 3025671--3031737 3021358--3027795 3021358--3027929 3021358--3027741 [5] 3024278--3029186 3029373--3031543 3031543--3034385 3019006--3030823 [9] 3015391--3030823 3021191--3028634 3021191--3024550 3021191--3033402 [13] 3031876--3031959 3015324--3031959 3023108--3033954 3016546--3020273 + ... omitted several edges
    ## This network is extracted from the STRING database. Only those associations with medium confidence (score>=400) are retained. And the users can restrict to those edges with high confidence (score>=700, for example) network <- igraph::subgraph.edges(org.Hs.string, eids=E(org.Hs.string)[combined_score>=700]) network
    IGRAPH UN-- 15341 316170 -- + attr: name (v/c), seqid (v/c), geneid (v/n), symbol (v/c), | description (v/c), neighborhood_score (e/n), fusion_score (e/n), | cooccurence_score (e/n), coexpression_score (e/n), experimental_score | (e/n), database_score (e/n), textmining_score (e/n), combined_score | (e/n) + edges (vertex names): [1] 3017550--3023854 3023931--3028317 3019304--3028317 3028317--3033319 [5] 3023602--3028317 3014709--3028317 3024678--3026825 3023468--3030905 [9] 3026117--3030905 3026845--3029085 3017265--3027473 3015527--3033837 [13] 3019960--3033973 3021862--3033174 3015979--3025568 3015355--3025568 + ... omitted several edges
    # In addition to data import, the package has also functions (see below) to deal with them directly. In these functions, the users only need to specify which genome/organism and which ontology to use. # Here, we use human TCGA mutation dataset as an example data(TCGA_mutations) symbols <- as.character(fData(TCGA_mutations)$Symbol) ## Enrichment analysis using Disease Ontology (DO) data <- symbols[1:100] # select the first 100 human genes eTerm <- dEnricher(data, identity="symbol", genome="Hs", ontology="DO")
    Start at 2015-07-21 17:49:38 First, load the ontology DO and its gene associations in the genome Hs (2015-07-21 17:49:38) ... 'org.Hs.eg' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.eg.RData) has been loaded into the working environment 'org.Hs.egDO' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egDO.RData) has been loaded into the working environment Then, do mapping based on symbol (2015-07-21 17:49:38) ... Among 100 symbols of input data, there are 100 mappable via official gene symbols but 0 left unmappable Third, perform enrichment analysis using HypergeoTest (2015-07-21 17:49:39) ... There are 917 terms being used, each restricted within [10,1000] annotations Last, adjust the p-values using the BH method (2015-07-21 17:49:41) ... End at 2015-07-21 17:49:42 Runtime in total is: 4 secs
    ## gene set enrichment analysis (GSEA) using KEGG pathways tol <- apply(exprs(TCGA_mutations), 1, sum) # calculate the total mutations for each gene data <- data.frame(tol=tol) eTerm <- dGSEA(data, identity="symbol", genome="Hs", ontology="MsigdbC2KEGG")
    Start at 2015-07-21 17:49:45 First, load the ontology MsigdbC2KEGG and its gene associations in the genome Hs (2015-07-21 17:49:45) ... 'org.Hs.eg' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.eg.RData) has been loaded into the working environment 'org.Hs.egMsigdbC2KEGG' (from http://supfam.org/dnet/RData/1.0.7/org.Hs.egMsigdbC2KEGG.RData) has been loaded into the working environment Then, do mapping based on symbol (2015-07-21 17:49:45) ... Among 19420 symbols of input data, there are 19038 mappable via official gene symbols but 382 left unmappable Third, perform GSEA analysis (2015-07-21 17:49:50) ... Sample 1 is being processed at (2015-07-21 17:49:50) ... 100 of 186 gene sets have been processed 186 of 186 gene sets have been processed End at 2015-07-21 17:50:45 Runtime in total is: 60 secs

    Source faq

    FAQ2.r

    Functions used in this FAQ