Big Data, Pathway analysis & Graph Databases

                                                                 by Sandeep Shantharam

What is BigData ? How Big is Big ?


“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.” [1]
According to Michael Stonebraker (MIT, Works on VoltDB);
  1. Big Volume
  2. Big Velocity
  3. Big Variety

According to Google BigQuery, Big Data Means 500 Million Rows.

[1]Undefined By Data: A Survey of Big Data Definitions -  http://arxiv.org/pdf/1309.5821v1.pdf

BigData and Bioinformatics;


Bigdata in Bioinformatics is not the same as other fields.

[2] Cost Per Raw MegaBase of DNA Sequence Graph -  http://www.genome.gov/sequencingcosts/

Is Bioinformatics really BigData ?

Yes

Big Data + Bio-Informatics

Big Data will not be about Sequencing Genomes and Proteins.
  • Software tools adapted to biological needs.
  • Databases that suit biological data.
  • Collaboration between various groups.

Advantages at IUPUI - 
  1. BigData processing Computation.
  2. File Storage systems scaling to Petabytes.
  3. Biological Patient Data.*
  4. Semantic Approach to Data.
[3] Gene-analysis firms reach for the cloud -  http://www.nature.com/news/gene-analysis-firms-reach-for-the-cloud-1.12634

Pathway Analysis

[4] Introduction to Pathway Analysis   http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf

Tools in Pathway Analysis

Pathway tools with Functional discovery[6] - 
  1. First Generation: Over-Representation Analysis (ORA) tools
  2. Second Generation: Functional Class Scoring (FCS) tools
  3. Third Generation: Pathway Topology (PT) tools

Pathway tools with Data Integration -
  1. WikiPathways (www. wikipathways.org)
  2. MetaCore (www.genego.com)
  3. Pathview (http://pathview.r-forge.r-project.org/)
[6]Khatri P, Sirota M, Butte AJ (2012) Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Comput Biol 8(2): e1002375. doi:10.1371/journal.pcbi.1002375

Biological Pathway Building Process

[7] Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC (2008) Getting started in biological pathway construction and analysis. PLoS Comput Biol 4(2): e16. doi:10.1371/journal.pcbi.0040016

Challenges in Pathway Analysis

Annotation Challenges -

Low resolution knowledge bases
Incomplete and inaccurate annotations
Missing condition- and cell-specific information.

Methodological Challenges - 

Benchmark data sets for comparing different methods
Inability to model and analyze dynamic response
Inability to model effects of an external stimuli

Computing Issues in Bioinformatics

  • Computation
    • algorithms, analysis, simulation, etc.
  • Visualization
    • seeing all of the data
  • Data management
    • storing and manipulating all of the data

Databases Tools in BioInformatics

Databases Publications in Bioinformatics [8]
Database Categories in Bioinformatics;
  1. SQL Databases like MySQL, Oracle
  2. NoSQL Databases like MongoDb, RDF Datastore and Neo4J
[8] MetaBase—the wiki-database of biological databases - http://nar.oxfordjournals.org/content/40/D1/D1250

Graph Databases

NoSQL Databases that uses graph structures with nodes, edges, and properties to represent and store data.
No Index Lookups, Scale Free,  No Schema
[9] Graph Databases Book -  http://graphdatabases.com/

Neo4J

True Graph Database that does not limit the number to triples like Triplestore.
Neo4j stores data in nodes connected by directed, typed relationships with properties on both, also known as a Property Graph.
NEO4J is opensource and contains lot of tool interfaces to program in due to the community

Neo4J - Hype or Real;

Dump Now, Connect Later.
Multiple Node Clusters.
Billions of Nodes (90% of Cases).
Nodes and Relationships have Properties.
REST Api Interface with tools in all programming languages.
Natural language Processing Style Query language.
  • Topple the stacks of records in a Relational Database while keeping all the relationships, and you’ll see a graph. Where an RDBMS is optimized for aggregated data, Neo4j is optimized for highly connected data.
  • A Key-Value model is great for lookups of simple or even complex values. When the values are themselves interconnected, you’ve got a graph. Neo4j lets you traverse quickly among all the connected values
  • The container hierarchy of a Document Database accommodates nice, schema-free data that can easily be represented as a tree. Which is of course a graph. Refer to other documents (or document elements) within that tree and you have a more expressive representation of the same data that you can easily navigate with Neo4j.[10]
[10] Neo4J Data Models -  http://www.neo4j.org/learn/nosql
**Markov Chain in Neo4j - http://nicolemargaretwhite.blogspot.com/2013/10/markov-chains-in-neo4j.html

Cypher v/s SQL

Joins in Sql and Cypher
SELECT bar.*
FROM foo 
JOIN bar ON foo.id = bar.foo_id
WHERE foo.id = 101 
START foo=node(101)
MATCH foo-->bar
RETURN bar 
SELECT bar.*, foo_bar.*
FROM foo 
  JOIN foo_bar ON foo.id = foo_bar.foo_id 
  JOIN bar ON foo_bar.bar_id = bar.id
WHERE foo.id = 1 
START foo=node(1)
MATCH foo-[foo_bar]->bar
RETURN bar, foo_bar 
[11] http://systay.github.io/blog/2011/11/06/cypher---a-view-from-a-recovering-sql-dba/
[12] http://stackoverflow.com/questions/6873772/sql-postgres-shortest-path-in-graph-recursion/6900257#6900257
 yfiyfit

Cypher v/s SPARQL

Sparql is designed for data models with Namespace Identifiers and each property are treated as nodes.
Sparql Queries are built for triplestores, like for triples being a data entity composed of subject-predicate-object, like "Protein A activates Gene X" or "Compound L inhibits Gene Z".

Cypher is just different.


SPARQL example in Chem2Bio2RDF -  http://cheminfov.informatics.indiana.edu:8080/bindingdb/snorql/

[13] Learn More about Cypher -  http://www.slideshare.net/maxdemarzi/cypher-12154713

Current Research in Graph


Bio4j is a bioinformatics graph based DB including most data available in UniProt KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), RefSeq, NCBI taxonomy, and Expasy Enzyme DB -  http://www.slideshare.net/pablo_pareja/graph-db-bioinformatics-bio4j-recent-applications-and-future-directions

CS team in University of California - Santa Barbara -  http://www.cs.ucsb.edu/~dbl/index.php

Stanford Statistics team worked on Graph theory for Immunology -  http://www-stat.stanford.edu/~susan/papers/path3.pdf /  http://www2.research.att.com/~volinsky/Graphs/slides/holmes.pdf

Disadvantages of Neo4J;


  • Not much standard tools available compared to SQL based Databases.
  • Limit of Nodes, Relationships and Relationship types at 34 Billion.
  • Data Cluster replication cannot be done.
  • Data reliability needs to be tested as its a new database.

Bigdata + Pathways + Graphs = ?



Integration into Lab


Substituting Neo4J for RDF Triplestore in Integrative Biology -  http://bib.oxfordjournals.org/content/14/1/109.full

Import Data from various databases of Lab like PAGED, HAPPI into Neo4j.

Build Visualization tools for data in Neo4j and form meaningful discoveries by Pathway analysis.
 

Questions ???

Thank you...and Links