Big Data, Pathway analysis & Graph Databases

                                                                 by Sandeep Shantharam

What is BigData ? How Big is Big ?

“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.” [1]
According to Michael Stonebraker (MIT, Works on VoltDB);
  1. Big Volume
  2. Big Velocity
  3. Big Variety

According to Google BigQuery, Big Data Means 500 Million Rows.

[1]Undefined By Data: A Survey of Big Data Definitions -

BigData and Bioinformatics;

Bigdata in Bioinformatics is not the same as other fields.

[2] Cost Per Raw MegaBase of DNA Sequence Graph -

Is Bioinformatics really BigData ?


Big Data + Bio-Informatics

Big Data will not be about Sequencing Genomes and Proteins.
  • Software tools adapted to biological needs.
  • Databases that suit biological data.
  • Collaboration between various groups.

Advantages at IUPUI - 
  1. BigData processing Computation.
  2. File Storage systems scaling to Petabytes.
  3. Biological Patient Data.*
  4. Semantic Approach to Data.
[3] Gene-analysis firms reach for the cloud -

Pathway Analysis

[4] Introduction to Pathway Analysis

Tools in Pathway Analysis

Pathway tools with Functional discovery[6] - 
  1. First Generation: Over-Representation Analysis (ORA) tools
  2. Second Generation: Functional Class Scoring (FCS) tools
  3. Third Generation: Pathway Topology (PT) tools

Pathway tools with Data Integration -
  1. WikiPathways (www.
  2. MetaCore (
  3. Pathview (
[6]Khatri P, Sirota M, Butte AJ (2012) Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Comput Biol 8(2): e1002375. doi:10.1371/journal.pcbi.1002375

Biological Pathway Building Process

[7] Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC (2008) Getting started in biological pathway construction and analysis. PLoS Comput Biol 4(2): e16. doi:10.1371/journal.pcbi.0040016

Challenges in Pathway Analysis

Annotation Challenges -

Low resolution knowledge bases
Incomplete and inaccurate annotations
Missing condition- and cell-specific information.

Methodological Challenges - 

Benchmark data sets for comparing different methods
Inability to model and analyze dynamic response
Inability to model effects of an external stimuli

Computing Issues in Bioinformatics

  • Computation
    • algorithms, analysis, simulation, etc.
  • Visualization
    • seeing all of the data
  • Data management
    • storing and manipulating all of the data

Databases Tools in BioInformatics

Databases Publications in Bioinformatics [8]
Database Categories in Bioinformatics;
  1. SQL Databases like MySQL, Oracle
  2. NoSQL Databases like MongoDb, RDF Datastore and Neo4J
[8] MetaBase—the wiki-database of biological databases -

Graph Databases

NoSQL Databases that uses graph structures with nodes, edges, and properties to represent and store data.
No Index Lookups, Scale Free,  No Schema
[9] Graph Databases Book -


True Graph Database that does not limit the number to triples like Triplestore.
Neo4j stores data in nodes connected by directed, typed relationships with properties on both, also known as a Property Graph.
NEO4J is opensource and contains lot of tool interfaces to program in due to the community

Neo4J - Hype or Real;

Dump Now, Connect Later.
Multiple Node Clusters.
Billions of Nodes (90% of Cases).
Nodes and Relationships have Properties.
REST Api Interface with tools in all programming languages.
Natural language Processing Style Query language.
  • Topple the stacks of records in a Relational Database while keeping all the relationships, and you’ll see a graph. Where an RDBMS is optimized for aggregated data, Neo4j is optimized for highly connected data.
  • A Key-Value model is great for lookups of simple or even complex values. When the values are themselves interconnected, you’ve got a graph. Neo4j lets you traverse quickly among all the connected values
  • The container hierarchy of a Document Database accommodates nice, schema-free data that can easily be represented as a tree. Which is of course a graph. Refer to other documents (or document elements) within that tree and you have a more expressive representation of the same data that you can easily navigate with Neo4j.[10]
[10] Neo4J Data Models -
**Markov Chain in Neo4j -

Cypher v/s SQL

Joins in Sql and Cypher
SELECT bar.*
FROM foo 
JOIN bar ON = bar.foo_id
WHERE = 101 
START foo=node(101)
MATCH foo-->bar
SELECT bar.*, foo_bar.*
FROM foo 
  JOIN foo_bar ON = foo_bar.foo_id 
  JOIN bar ON foo_bar.bar_id =
WHERE = 1 
START foo=node(1)
MATCH foo-[foo_bar]->bar
RETURN bar, foo_bar 

Cypher v/s SPARQL

Sparql is designed for data models with Namespace Identifiers and each property are treated as nodes.
Sparql Queries are built for triplestores, like for triples being a data entity composed of subject-predicate-object, like "Protein A activates Gene X" or "Compound L inhibits Gene Z".

Cypher is just different.

SPARQL example in Chem2Bio2RDF -

[13] Learn More about Cypher -

Current Research in Graph

Bio4j is a bioinformatics graph based DB including most data available in UniProt KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), RefSeq, NCBI taxonomy, and Expasy Enzyme DB -

CS team in University of California - Santa Barbara -

Stanford Statistics team worked on Graph theory for Immunology - /

Disadvantages of Neo4J;

  • Not much standard tools available compared to SQL based Databases.
  • Limit of Nodes, Relationships and Relationship types at 34 Billion.
  • Data Cluster replication cannot be done.
  • Data reliability needs to be tested as its a new database.

Bigdata + Pathways + Graphs = ?

Integration into Lab

Substituting Neo4J for RDF Triplestore in Integrative Biology -

Import Data from various databases of Lab like PAGED, HAPPI into Neo4j.

Build Visualization tools for data in Neo4j and form meaningful discoveries by Pathway analysis.

Questions ???

Thank you...and Links