Environmental Scenario Search Engine

Mining environmental data archives

Environmental Scenario Search Engine (ESSE)

M. Zhizhin, Geophysical Center, Russian Acad. Sci.
E. Kihn, National Geophysical Data Center, NOAA

Problem statement

Environmental informatics (Hilty 1995) is a rapidly expanding area of computer and natural science. The increasing data volumes from today's collection systems and the needs of the scientific community which requires the inclusion of an integrated and authoritative representation of the natural environment in their analysis needs a new approach to data management and access. The natural environment includes elements from multiple domains such as space, terrestrial weather, oceans and terrain. Systems such as the Global Change Master Directory (GCMD) from NASA or the Master Environmental Library (MEL) from the DOD and others provide the ability to search by "keywords" for archived environmental data sets distributed across the network, but the ability to search for specific "scenarios" (sets of conditions within the environmental data) does not yet exist.

At the same time, the environmental modeling community has begun to develop several archives of continuous environmental representations. These archives take observational data and through modeling create a regular, parameterized view of the Earth system. The models use all available observation data as initial conditions for the numerical models, so the resulting data sets jointly may be considered as authoritative high-resolution representation of terrestrial weather and the near-Earth space during the last 50 years.

So when interacting with these enormous resources, imagine for example that the end user doesn't need all the weather data covering Florida for the last 50 years, but rather needs an example of a typical Florida spring storm. Further imagine that the user needs to know how often such storms occur or if they have been increasing in the last 10 years. The Environmental Scenario Search Engine (ESSE) will address such problems. The prime requirement of the ESSE system design will be to allow the user to query environmental data archives in human linguistic terms. Natural language is not easily translated into the absolute terms of 0 and 1 which make up the digital world. The mapping between human language and computer systems will involve fuzzy logic. Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the concept of partial truth -- truth values between "completely true" and "completely false". It was introduced by Dr. Lotfi Zadeh (Zadeh 1965) of UC/Berkeley in the 1960's as a means to model the uncertainty of natural language. The ESSE will act as a bridge between the questions the user needs to act of the environment and the data which describes it.

Architecture

The ESSE architecture will rely heavily on an object-oriented fuzzy logic engine to perform searching and statistical analysis of the distribution of the identified events for the user. It will allow parallel mining of several distributed data sources, possibly from different subject areas, and not limited to only space physics or terrestrial weather. Both the fuzzy logic engine and data sources will be implemented as web services (Figure 1), so that third-party applications written in different languages (Java, C++, Perl, C#) can select from different data sources and search for events with the fuzzy logic engine using interfaces and data structures derived from the definitions of the web-services (WSDL). We will use web-services as mediators to the environmental archives helping to abstract into a generic ESSE data model and to bypassing security limitations, posed by firewalls on the most of the other connection protocols.


Figure 1 ESSE web services

To illustrate possible use, the ESSE system will include a prototype user interface implemented as a web application. In the web application it will be possible:

  • to discover data sources by keyword-based metadata search;
  • to use predefined weather events (e.g. "ice storm", "flood") as well as to define the searching event as a combination of fuzzy conditions on a set of environmental parameters (e.g. "high temperature and low relative humidity") for data mining;
  • to review the statistics of detected events;
  • to visualize the selected event as time series plots or contour maps;
  • to download the event data in self-describing format (NetCDF or HDF) to the user's workstation.

    Web services for the ESSE authoritative data sources and fuzzy search engine, as well as the prototype user interface will be installed on two mirroring servers one in the U.S. and one in Russia.

    Authoritative data sources

    The real connection between the ESSE system and a given user community are the data sources that support it. It will be relatively easy to add a new data source to the ESSE through the web-services interface, so the list below should not be taken as limiting but rather as a starting point that demonstrates the ESSE functionality.

    Data Source Type Sample Parameters Temporal Coverage Spatial Coverage Size URL
    NCEP/NCAR Meteo. Temperature, Windspeed, Cloud Cover 1949 - present Global
    @2.5 Deg.
    250 Gb http://dss.ucar.edu/pub/reanalysis/
    SPIDR Space Kp index, Sunspot Number 1933 - present Global 30 Gb http://spidr.ngdc.noaa.gov

    The first thing to notice is the relatively large size of the archives. Using the distributed database concept allows us to perform interactive mining on these substantial data sources. The second thing to notice is the long temporal ranges. The ESSE will be most useful when the size of the archive prohibits or makes impractical searching by hand.

    To describe each resource briefly, the NCEP/NCAR reanalysis data archive was derived from numerical weather prediction model runs. It represents gridded output on a regular time step (6 hours) and fixed grid step (2.5 deg). The model uses data ingest procedures to assimilate observational data into model results to produce a consistent picture of the terrestrial weather during the last 50 years. The Space Physics Interactive Data Resource (SPIDR) is an observational data source which incluces the output of numerical models. The SPIDR system currently handles the following: Defense Meteorological Satellite Program (DMSP) visible, infrared and microwave browse imagery, ionospheric parameters, geomagnetic variations, geophysical and solar indices, GOES satellite x-ray, plasma, and magnetometer data, cosmic rays, and solar radio telescope data sets.

    The ESSE would also plan to add gridded space weather, ocean and terrain data in the near future, making the ESSE mining technology available across a wide representation of the "Digital Earth" environment. To add a new data source to the ESSE it will require: a) to write a web-service implementing standard interface with 2 methods, getMetadata and getData, which subsets the data source into a simple data model; b) create metadata document following pre-defined XML schema describing parameters, time and spatial coverage of the data source; c) add the web service address and credentials to the ESSE configuration.

    Fuzzy search algorithm

    People often use qualitative notions to describe such variables as temperature, pressure, pulse rate. In reality, it is difficult to put a single threshold between what is called "warm" and "hot". Fuzzy set theory serves as a translator from vague linguistic terms into strict mathematical objects. This is exactly what is needed to bridge the gap between current environmental archives and the policy makers, users and scientists who need to access them.

    Intelligent environmental scenario searching across the distributed resources will be performed within the ESSE fuzzy search engine. The scenario editor from the ESSE user interface will be used to formulate a set of conditions to be satisfied by the candidate events. The search conditions may be specified in a number of ways depending on the user's familiarity with the region/data of interest. An expert user can specify exact thresholds and/or limitations that must be maintained on certain parameters. Conditions can also be specified via abstract natural language definitions for each parameter. For instance, temperature limitations can be specified as "hot", "cold", or "typical". The query can also be specified in terms of predefined rules which collect conditions into a named set. Thus, a user can specify the following weather search request:

    (VERY LARGE "precipitation rate") AND ("surface temperature" ABOUT 10 C) AND (LOW "vertical wind speed at pressure level 1000 mbar")

    or the following search request for severe magnetic storms in the space environment:

    (VERY LARGE "Kp index") AND (VERY LOW "Dst index").

    The result of such a request reported by the fuzzy search engine will be a list of the "most likely" dates for the event ranked by the sorted values of the aggregated multidimensional fuzzy membership function (MF). The aggregation will be done using fuzzy analog of the logical AND operator.

    The ESSE client application will be searching for events in the environment where the input variables and the one-dimensional MFs depend on time, as well as the fuzzy AND aggregation of the desired conditions. We consider the values of the resulting time series as the "likeliness" that the environmental event to occurred at the time moment t, and search for the highest values of the aggregated MF and consider these to be the most likely candidates for the environmental events.

    To be able to search for events like "the hottest day" or "the hottest week" we introduce the concept of event duration. For example, the time step of the parameter from the NCEP/NCAR reanalysis database is 6 hours, so the minimum event duration is also 6 hours, but the event duration could be also 1 day, 1 week, etc. We will do a moving average of the input parameters with the time window of the event duration before calculation of the one-dimensional MFs and the fuzzy AND aggregation.

    Possible use

    The applications of the ESSE systems are broad. As more and more data archives become available through projects like CLASS (NOAA), EOSDIS (NASA), DODS (Univ. RI) and other network accessible data systems, the tools to extract information from them become more valuable. As Nature declared in a 1999 article (Reichhardt 1999) "It's sink or swim as a tidal wave of data approaches". ESSE can help users prepare by providing tools which sift through the vast quantities of data available on-line and point at the interesting bits. This means that even with the volume of data increasing so rapidly and the number of researchers remaining relatively level we can hope to extract the most valuable information from the observations and carry that back to the relevant scientific communities.

    The application of fuzzy logic based data tools goes far beyond simple event selection. For example an ever present issue when dealing with these large data sets is quality control. There is simply too large a volume to reasonably screen by hand. Searching capabilities can be used, for example, to analyze climate trends. Using techniques such as peer-matching and expert systems we can extend the ESSE to monitor data and alert data managers to changes and anomalies. As the computational power available expands we can extend the system into areas such as data classification whereby we can identify modes of the environment and perhaps identify new unknown relations in specific regions.

    Finally the emergence of a network infra-structure for data access is providing new opportunities for the scientific researcher. It is now fairly trivial to reach out across discipline boundaries and access data in an immediately useable format. This is true for example in the case of the terrestrial weather community being able to make use of the space data made available through the SPIDR, for example, to study the influence of space weather on the Earth's climate. With these opportunities come challenges. As researchers expand into domains in which they may not be expert they will come to rely on intelligent tools to support them.

    The mission of the ESSE is fundamentally to help a user distill the vast amount of available data down to a manageable amount of information. Beyond this however the ESSE has applications in the area of data quality control, data classification and even forecasting. The increasing data volumes available in the future demand different techniques to handle it and the ESSE framework is one exceptional method for a user to handle it.

    The found event can be used as a source of a real scenario for computer games and simulators. Project deliverables

    1. Web-service for fuzzy logic search engine
    2. Web-services for several authoritative environmental data sources
    3. Prototype web-application
    4. Documentation including user guide, programmer guide, and the system whitepaper
    5. Web links for free download of the software sources and the environmental databases

    In order to enable true platform independence the team is working in both Open Source and Microsoft ASP.NET web services infrastructures, developing scientific software with the same external and internal interfaces. The same portal user interface can consume either of implementations.

    Microsoft .NET Framework includes comprehensive set of classes that supersedes many of the commonly used open source libraries. This is especially true for XML processing libraries and enabled easy creation of wrapper classes ensuring portability of the source code created in this project.

    References

  • DMSO (2003) Homepage of the Master Environmental Library. Available from: http://mel.dmso.mil
  • Fayyad, U. M., Piatetsky-Shapiro, G. & Smyth, P. (1996). From data mining to knowledge discovery: An overview. Advances in Knowledge Discovery and Data Mining, AAAI Press and the MIT Press: Chapter 1, 1-34.
  • Hibbard, W. (1998). VisAD: Connecting people to computations and people to people, Computer Graphics 32, No. 3, 10-12.
  • Hilty, L. M., Page, B., Radermacher, F.J. & Riekert, W.-F., Ed. (1995). Environmental Informatics as a New Discipline of Applied Computer Science. Environmental Informatics, Kluwer Academic Publishers.
  • Jang, J.-S. R., Sun, C.-T., Mizutani, E. (1997). Neuro-Fuzzy and Soft Computing, Prentice Hall.
  • Kalnay, E. e. a. (1996). "The NCEP/NCAR 40-year reanalysis project." Bull Am. Meteorol. Soc. 77: 437-471.
  • Eric A.Kihn, Mikhail N.Zhizhin, Steven J.Lowe, Dr. Alexander V.Troussov, Ronald E.Englebretson. The Weather Scenario Generator, International Conference on Web-Based Modeling and Simulation, January 17 - 20, 1999
    Available from: http://ideas.ngdc.noaa.gov/ideas/papers/websim99/WebSim.htm
  • E.A.Kihn1, M.Zhizhin , R.Siquig and R.Redmon. The Environmental Scenario Generator (ESG): a distributed environmental data archive analysis tool, CODATA Data Science Journal, Volume 3, 4 February, 2004, pp. 10-28
    Available from: http://ideas.ngdc.noaa.gov/ideas/papers/DS259.pdf
  • National Oceanic and Atmospheric Administration (NOAA), Homepage of the Comprehensive Large Array-data Stewardship System (CLASS).
    Available from: http://www.saa.noaa.gov/cocoon/nsaa/products/welcome
  • National Aeronautics and Space Administration, Homepage of the EOSDIS Core System (ECS) project.
    Available from: http://edhs1.gsfc.nasa.gov/
  • National Center for Supercomputing Applications, Homepage of the Hierarchical Data Format (HDF).
    Available from: http://hdf.ncsa.uiuc.edu/
  • Reichhardt, T. (1999). Nature: 517-520.
  • Sun Microsystems, Inc. (SUN2), Homepage of the Java API for XML-Based RPC (JAX-RPC).
    Available from: http://java.sun.com/xml/jaxrpc/
  • Unidata Program Center c/o University Corporation for Atmospheric Research, network Common Data Form (netCDF).
    Available from: http://www.unidata.ucar.edu/packages/netcdf/
  • University of Rhode Island, Homepage of the DODS---Data Access Protocol,
    Available from: http://www.unidata.ucar.edu/packages/dods/
  • World Wide Web Consortium (W3C), Web Services.
    Available from: http://www.w3.org/2002/ws/
  • Zadeh, L. (1965). Fuzzy Sets, Information and Control, Vol. 8: 338-353.
  • M.Zhizhin, D.Mishin, D.Kokovin, A.Polyakov, E.Kihn, R.Redmon. Open Modular Interactive Mapping Technology for Visualization of Geophysical Data on the Internet, Computer graphics and geometry, Volume 6, Spring 2004, pp. 25-49
    Available from: http://clust1.wdcb.ru/papers/openMap/index.html

    ESSE Enabling Technologies

    XQuery: an XML query language http://www.w3.org/TR/xquery/
    http://exist.sourceforge.net/
    Microsoft .NET http://www.microsoft.com/net/
    Microsoft SQL Server 2005 http://www.microsoft.com/sql/
    Web services http://www.w3.org/2002/ws/
    http://msdn.microsoft.com/webservices/
    http://www-130.ibm.com/developerworks/webservices/
    OGSA DAI http://www.ogsadai.org.uk/
    Axis web services container http://ws.apache.org/axis/index.html
    RDBMS MySQL http://www.mysql.com
    Metadata standards http://www.fgdc.gov/metadata/metadata.html
    Semantic Web for Earth and Environmental Terminology http://sweet.jpl.nasa.gov/ontology/
    OpenGIS Web Map Service http://www.opengeospatial.org/
    http://mapserver.gis.umn.edu/
  • ESSE site Home
    Project Description
    Download
    Project Team
    Partners
    Project Activity
    Related news
    Project Publications
    Frequently Asked Questions