Big Data, Pathway analysis & Graph Databases
by Sandeep Shantharam
What is BigData ? How Big is Big ?
“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.” 
According to Michael Stonebraker (MIT, Works on VoltDB);
- Big Volume
- Big Velocity
- Big Variety
According to Google BigQuery, Big Data Means 500 Million Rows.
BigData and Bioinformatics;
Bigdata in Bioinformatics is not the same as other fields.
Is Bioinformatics really BigData ?
Big Data + Bio-Informatics
Big Data will not be about Sequencing Genomes and Proteins.
Software tools adapted to biological needs.
- Databases that suit biological data.
- Collaboration between various groups.
Advantages at IUPUI -
- BigData processing Computation.
- File Storage systems scaling to Petabytes.
- Biological Patient Data.*
- Semantic Approach to Data.
Tools in Pathway Analysis
Pathway tools with Functional discovery -
- First Generation: Over-Representation Analysis (ORA) tools
- Second Generation: Functional Class Scoring (FCS) tools
- Third Generation: Pathway Topology (PT) tools
Pathway tools with Data Integration -
- WikiPathways (www. wikipathways.org)
- MetaCore (www.genego.com)
- Pathview (http://pathview.r-forge.r-project.org/)
Khatri P, Sirota M, Butte AJ (2012) Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Comput Biol 8(2): e1002375. doi:10.1371/journal.pcbi.1002375
Biological Pathway Building Process
 Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC (2008) Getting started in biological pathway construction and analysis. PLoS Comput Biol 4(2): e16. doi:10.1371/journal.pcbi.0040016
Challenges in Pathway Analysis
Annotation Challenges -
Low resolution knowledge bases
Incomplete and inaccurate annotations
Missing condition- and cell-specific information.
Methodological Challenges -
Benchmark data sets for comparing different methods
Inability to model and analyze dynamic response
Inability to model effects of an external stimuli
Computing Issues in Bioinformatics
algorithms, analysis, simulation, etc.
storing and manipulating all of the data
Databases Tools in BioInformatics
Databases Publications in Bioinformatics 
Database Categories in Bioinformatics;
- SQL Databases like MySQL, Oracle
- NoSQL Databases like MongoDb, RDF Datastore and Neo4J
NoSQL Databases that uses graph structures with nodes, edges, and properties to represent and store data.
No Index Lookups, Scale Free, No Schema
True Graph Database that does not limit the number to triples like Triplestore.
Neo4j stores data in nodes connected by directed, typed relationships with properties on both, also known as a Property Graph.
NEO4J is opensource and contains lot of tool interfaces to program in due to the community
Neo4J - Hype or Real;
Dump Now, Connect Later.
Multiple Node Clusters.
Billions of Nodes (90% of Cases).
Nodes and Relationships have Properties.
REST Api Interface with tools in all programming languages.
Natural language Processing Style Query language.
Topple the stacks of records in a Relational Database while keeping all the relationships, and you’ll see a graph. Where an RDBMS is optimized for aggregated data, Neo4j is optimized for highly connected data.
A Key-Value model is great for lookups of simple or even complex values. When the values are themselves interconnected, you’ve got a graph. Neo4j lets you traverse quickly among all the connected values
The container hierarchy of a Document Database accommodates nice, schema-free data that can easily be represented as a tree. Which is of course a graph. Refer to other documents (or document elements) within that tree and you have a more expressive representation of the same data that you can easily navigate with Neo4j.
Cypher v/s SQL
Joins in Sql and Cypher
JOIN bar ON foo.id = bar.foo_id
WHERE foo.id = 101
SELECT bar.*, foo_bar.*
JOIN foo_bar ON foo.id = foo_bar.foo_id
JOIN bar ON foo_bar.bar_id = bar.id
WHERE foo.id = 1
RETURN bar, foo_bar
Cypher v/s SPARQL
Sparql is designed for data models with Namespace Identifiers and each property are treated as nodes.
Sparql Queries are built for triplestores, like for triples being a data entity composed of subject-predicate-object, like "Protein A activates Gene X" or "Compound L inhibits Gene Z".
Cypher is just different.
Current Research in Graph
Disadvantages of Neo4J;
Not much standard tools available compared to SQL based Databases.
Limit of Nodes, Relationships and Relationship types at 34 Billion.
Data Cluster replication cannot be done.
Data reliability needs to be tested as its a new database.
Bigdata + Pathways + Graphs = ?
Integration into Lab
Import Data from various databases of Lab like PAGED, HAPPI into Neo4j.
Build Visualization tools for data in Neo4j and form meaningful discoveries by Pathway analysis.