Top 10 Capabilities For Exploring Complex Relationships In Data For Scientific Discovery

With all of the discussion about Big Data these days, there is frequest reference to the 3 V’s that represent the top big data challenges: Volume, Velocity, and Variety. These 3 V’s generally refer to the size of the dataset (Volume), the rate at which data is flowing into (or out of) your systems (Velocity), and the complexity (dimensionality) of the data (Variety). Most practitioners agree that big data volume is indeed huge, but that is not necessarily big data’s biggest challenge, at least not in terms of data storage capacities, which are growing rapidly also and keeping pace with data volume. The velocity of big data is also a very big challenge, though primarily for applications and use cases that specifically demand near-real-time analysis and response to dynamic data streams. However, unlike volume and velocity, most will agree that the variety (complexity) of the data is truly big data’s biggest mega-challenge at all scales and in most applications. Consequently, any dataset (whether large or small) that has hundreds or thousands (or more) dimensions per data item is difficult to explore, mine, and interpret. So, when you find a data tool that helps in the analysis of high-dimensional data, you stop and take a look. I did that recently with the AutoDiscovery tool from Butler Scientifics.

Exploratory Data Analysis (EDA) for Small Data

First, note that this tool is not explicitly for big data, though it is certainly useful for small subsets of big data: that is, small data! The focus is therefore on scientific discovery from small data. This is the style of data science that nearly every scientist needs to carry out on a routine basis, since data from daily experiments are rarely in the rarified realm of big data, but modern scientific instruments often do generate large numbers of measured parameters per data object. AutoDiscovery enables the discovery, exploration, and visualization of correlations in high-dimensional data from such experiments – i.e.,Exploratory Data Analysis (EDA).

The Top 10 Features of an EDA Tool

One of the most sensible characteristics of AutoDiscovery is that it does not try to be the “one tool” for all possible statistical analyses. There are other statistical software packages that already do that, and there is no need to compete with giants like R, SAS, or SPSS. Consequently, AutoDiscovery aims to satisfy a very particular scientific discovery requirement: correlation discovery in the high-dimensional parameter spaces of complex (high-variety) data. It is a complement to those other (more comprehensive) statistical packages, not a competitor.

Correlation discovery alone may seem relatively simple and thus a specialized tool for it seems unnecessary. However, several proprietary features within AutoDiscovery can more than justify its use. The top 10 features of AutoDiscovery for exploring complex relationships in data for scientific discovery are: (1) simplified, visual integration of data from multiple sources (including “primary key” discovery across multiple data tables); (2) the streamlined easy-to-use EDA visual environment for data selection, filtering, and exploration; (3) rapid discovery of interesting findings that can confirm (or deny) initial hypotheses, inform further experimentation and experimental design, and generate multiple additional testable hypotheses; (4) automatic search for significant correlations across the full set of pairwise parameter combinations in your dataset; (5) automatic search for significant correlations between virtual parameters (i.e., the ratios of the original input parameters); (6) quantitative assessment and evaluation of the value of each finding; (7) automatic sorting of results, including deprecation of weak and insignificant correlations, placing them lower in the output listings, though still searchable if wanted; (8) optional correlation analyses within multiple sub-segments of each parameter’s range of possible values (thereby enabling discovery of changes in the parameter correlations across these limited ranges of the data values, which is a reality often observed in complex scientific experiments); (9) visual tools that present the linked network of the most significant pair-wise correlations among scientific parameters; and (10) correlation analysis outputs (tables, visualizations, and the ability to export the correlation tables) that enable efficient and effective browsing, exploration, and navigation of causal connections (and the causal direction) in correlated data items.

Exploratory and Confirmatory Analyses

For scientists, the use of EDA for initial exploratory studies is crucial in the early stages of an experiment – both exploratory and confirmatory analyses enable discovery, hypothesis testing, and refinement of scientific hypotheses. More detailed analysis would follow from initial discoveries of interesting and significant parameter correlations within complex high-dimensional data. An article was recently published in Nature on “Statistical Errors – p Values, the Gold Standard of Statistical Validity, Are Not as Reliable as Many Scientists Assume” (by Regina Nuzzo, Nature, 506, 150-152, 2014). In this article, Columbia University statistician Andrew Gelman states that instead of doing multiple separate small studies, “researchers would first do small exploratory studies and gather potentially interesting findings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the findings.” In other words, a disciplined scientific methodology that includes both exploratory and confirmatory analyses can be documented within an open science framework (e.g., https://osf.io) to demonstrate repeatability and reproducibility in scientific experiments. This would break down the walls of “black box” software that hide the complex analyses that are being applied to complex data. The ability of the scientist and her/his peers to reproduce an experiment’s rationale as well as its results will yield greater transparency in scientific research. AutoDiscovery is a tool that can further the Open Science cause.

Four Benefits of Early Findings from EDA

AutoDiscovery objectively discovers interesting findings in the early stages of research. This provides four additional benefits to the scientist in the EDA stage of research: (a) informs improvements in the experimental design; (b) validates and substantiates a priori hypotheses; (c) generates multiple new testable hypotheses; and (d) reveals promising “hot spots” in the data that require deeper statistical analysis. The latter capability is quite exciting – “Interestingness” Discovery – i.e., finding the unexpected, unusual, “interesting” regions and features within your data’s multi-dimensional parameter space! Especially with complex data, the combined sum of these capabilities empowers the data scientist to tell the “data story” in the full dimensionality of the dataset, not just in a few limited 2-D or 3-D projections. Consquently, AutoDiscovery is an objective quantifiable feature-discovery tool that presents the most interesting correlations to end-users for efficient and effective EDA: efficient in the sense that automatic discovery of the most interesting data correlations for deeper analysis avoids lots of useless searches and manual manipulations of the data collection; and effective in the sense that novel discoveries (beyond known correlations and expected relationships) are made possible.

Three Types of Data Relationships

The discovery of more complex relationships (e.g., multi-valued or non-monotonic data patterns) in multi-dimensional data requires specialized tools and transformations that are currently beyond the scope of AutoDiscovery (or of any other readily accessible tool), though discovery of these types of patterns may be enabled in future releases of EDA tools. An example of a multi-valued data relationship is the S-shaped 2D surface embedded in a 3D space (shown here) – discovery of such hypersurfaces requires special algorithms (such as local linear embedding or manifold learning) that are not available in off-the-shelf EDA packages. An example of a non-monotonic data relationship is revealed in the solution to the “island of games” puzzle (the problem statement is here; and the solution is revealed here). Monotonic relationships typically underlie cause-effect studies in science, and that’s why EDA software (such as AutoDiscovery) currently targets the discovery of those types of data relationships.

AutoDiscovery Case Study and Getting Started

The Butler Scientifics website reports a case study in which neuroscientists in the Laboratory of Adult Neurogenesis at Cajal Institute (CSIC, Madrid) used AutoDiscovery to discover correlations between neuron properties and behavior patterns, and the effects of stress and anxiety on learning and memory capacity. They describe the results this way: “AutoDiscovery took less than 2 hours to find out not only all the correlations that the group had identified during their 8-weeks intensive work but also several key correlations that, with a further confirmatory phase, confirmed their original hypothesis.” That is precisely the type of efficiency amplifier that I can use in my research, and I believe that other scientists will experience similar accelerations of their discovery science.

Read more about AutoDiscovery, download a free trial, request a demo, and begin discovering the most interesting features in your ocean of complex data today at http://www.butlerscientifics.com/. A new release (ver. 2.0) of AutoDiscovery is now available for all scientists and data explorers to begin exploring the complex relationships within their data for scientific discovery. The development team at Butler Scientifics is ready to support users of their AutoDiscovery tool and to provide licensing terms that can fit any budget (for individuals, or small research teams, or entire research institutions).

Kirk is a data scientist, top big data influencer and professor of astrophysics and computational science at George Mason University.He spent nearly 20 years supporting NASA projects, including NASA’s Hubble Space Telescope as Data Archive Project Scientist, NASA’s Astronomy Data Center, and NASA’s Space Science Data Operations Office. He has extensive experience in large scientific databases and information systems, including expertise in scientific data mining. He is currently working on the design and development of the proposed Large Synoptic Survey Telescope (LSST), for which he is contributing in the areas of science data management, informatics and statistical science research, galaxies research, and education and public outreach.