The State of Environmental Data Management

By Philip Edwards and Paul Mills

Environmental Testing & Analysis
September/October 2000

 

Why do we concern ourselves with electronic data? It is rapidly becoming the “lingua franca” of the environmental industry, the common medium used for collection, processing, transmission, translation, reporting, and storage of information about the environment. Its use and misuse can determine the success of site investigations, decision-making, and cleanups. It offers the desired goals of rapid transmission and ease of use, but the underlying data quality may be completely unknown. The computer’s big advantage over manual processing is speed, but it makes no distinction about quality—“garbage in, garbage out” still applies.

The long-standing punch line in the lab business has been “Turnaround, quality, price…pick two." As the industry has matured, data users are becoming more and more sophisticated and are now starting to demand all three. Whether at a traditional, fixed-site lab or in an on-site or mobile facility, turnaround time is more likely to be measured in days rather than weeks. Sometimes a job requires results be available within hours of collection. Methods may require confirmation of results at an off-site facility, again necessitating fast turnaround time. Today’s data users want to see more data, faster, cheaper, and with no mistakes, in the specified format the first time, every time. What does this portend for the future of data collection, validation, processing, reporting, and use in the environmental industry?

What Data Are We Really Using?
One of the common problems facing data managers is that two distinct data sets that result from the hardcopy laboratory report and electronic data deliverable (EDD). While both the hardcopy and electronic data originate from the same laboratory data systems, it is not unusual for them to take very different journeys to the data manager’s desk. Generally, the hardcopy report is generated directly from the laboratory information management system (LIMS), reviewed and approved by the appropriate supervisors, and released by the laboratory manager. The electronic data, on the other hand, is exported from the LIMS and is typically imported into an external utility in which the data is formatted, mapped, rounded and translated to meet the requirements of the EDD specification. To make matters worse, very rarely does the final EDD undergo the same level of review experienced by the hardcopy.

Thus, we have two data deliverables generated using different procedures, and receiving very different levels of review. And almost without exception, the level of quality of the hardcopy can be expected to exceed that of the EDD. Unfortunately, the lower quality electronic results are what end up in the hands of end-users and decision-makers while the higher quality hardcopy ends up in a warehouse gathering dust.

Data to Information to Knowledge…
In many industries, “knowledge management” has become a hot topic. Knowledge management is the third step in the evolutionary process: data management to information management to knowledge management. Unfortunately, this area is one in which environmental lags significantly behind other industries. Until we are able to consistently collect data of at least minimally acceptable quality in a standardized data structure, we will not be ready to move on to the information stage, let alone focus our attention on knowledge management.

Lack of Use of Standardized Electronic Data Format
In order to use our electronic data, we first have to get our electronic data, a task with which we have been collectively struggling for over a decade. There are many factors influencing the transmission of data from the generator to the user. These include the requirements of the agency for whom work is being performed and data reported, the structure of the intermediate database used by contractors performing work, and the ability of the laboratory generator to produce reliable data files that are consistent with other deliverables. Obtaining high-quality electronic data has been a topic of interest at many conferences, committees and workgroups during the last few years, and standardizing the format of transmission generally finds its way to the center of the discussion. This issue is of particular interest to the laboratory community, primarily because they are the largest generators of data in almost any environmental program, and they are regularly under contract to multiple firms who request widely varying electronic and hardcopy deliverables. The efforts required to produce thousands of records reliably in every imaginable data format are significant, and the difficulty in reviewing multiple deliverables for accuracy in content and structure often results in problems for the various data users that are not detected prior to release of data by the generators. The benefits of using only one data transmission format for all environmental projects are clear; the reasons that the industry has not standardized are more complicated. Historical databases exist nationwide, most using different data structures and data dictionaries. Many of the large environmental contractors have developed proprietary tools and reports for use with their own systems. Nevertheless, the costs of obtaining data in a custom format, coupled with the high requirements for complete and correct electronic data, are forcing the issue of standardization.

Some groups, including the American Council of Independent Laboratories (ACIL), have formed committees to generate a new ‘standard’ to introduce to the environmental industry, but to date, these efforts have not been met with overwhelming success. The complexities of environmental data require a significant effort in the understanding of the final use of the data by the many groups of professionals who rely upon it. Further, the task of documenting and maintaining dictionaries to support a new format should not be underestimated. Therefore, it is to our advantage to consider the use of an existing data format. If a new product is found to be a necessity, then, as in the development of any product, it is reasonable to start with the best foundation currently available and build up, as opposed to starting ‘from scratch’. While there are hundreds of varieties of proprietary and customized formats, a few standards form the basis of the genres of environmental formats:

Agency Standard: This format has time on its side – it has been used throughout the USEPA’s Contract Laboratory Program (CLP) for more than ten years. Tools such as Computer Assisted Data Review (CADRE) have been developed by government agencies to provide automated evaluation of data quality and produce reports for project deliverables. There are drawbacks to this format, however, such as its complexity and the fact that it centers around the CLP, thus limiting its use in the wide range of projects being executed under other programs. Finally, this is a transmission format only; the Agency Standard is not associated with a database format for the purpose of warehousing and using the data it contains.

Department of Energy (DoE) Environmental Electronic Master Specification (DEEMS): DEEMS has been developed to be everything the Agency Standard is not, and then some. Where the CLP program is strictly defined, with a complex record definition and format, the DEEMS deliverable is described as being just about anything the user wants it to be. It is a format that is defined by data tags, and is based on implementations developed for each analytical method. From there, any information desired by the user can be captured, simply by defining the information and providing it. The drawbacks of this format are similar to that of the Agency Standard only in that it also is a transmission format, not associated with a database model. The problems that may be encountered in the use of the DEEMS deliverable include extremely large file sizes, difficulty in obtaining readers or translators for extracting the data, and complications in incorporating the wide range of captured data into a useful database.

Environmental Resources Program Information Management System (ERPIMS) - formerly the Installation Restoration Program Information Management System (IRPIMS): The Air Force’s deliverable was designed and is currently maintained primarily with the needs of its own end users in mind. However, its end users have collected a significant amount of data through the use of this deliverable. Having transmitted millions of records to an intensely screened central database located at Brooks Air Force Base, it can probably safely claim to have the most mileage of the formats described here. As a result, many of the more subtle problems inherent to environmental data have already been encountered and dealt with by the users of the ERPIMS system. It should be noted that the changes that have been incorporated into the latest version of this specification have added flexibility which may allow other programs and agencies to adopt this format and capture information not typically required under the Air Force Center for Environmental Excellence (AFCEE) program.

The information above points toward the acceptance of the ERPIMS format, at the very least, as the basis for a standardized data structure. Other factors should also be considered; a key point is that ERPIMS is the de facto electronic data format in the environmental industry. It is notable that the majority of laboratories and A/E firms doing work in the DoD arena have adopted much of the nomenclature, basic structure, and data dictionaries provided under the ERPIMS specification. The data transmission format has been adapted to support information not required under the current Air Force program, such as calibration data. Finally, the fact that the specification provides the structure of a complete relational environmental database that can be set up on nearly any database platform is a benefit that should not be overlooked.

Variable Electronic Data Quality
The level of electronic data quality directly affects the final outcome of an environmental project; unfortunately, in many cases, poor data quality can cost a project more than would have been expended without any automation efforts at all. In order to direct a project towards success, it is helpful to understand some of the problems inherent to electronic data collection and management.

As is to be expected, some data problems are the result of systematic problems in the laboratory, for example, misunderstandings of criteria and their application, or scheduling and resource difficulties. The impacts of these problems can be severe, requiring re-sampling and re-analysis, and causing program delays and overruns. Others, however, are more difficult to detect, and have an equal if not greater impact on system automation. These would include discrepancies in the electronic data system from those generating hardcopy reports, minor deviations from the data dictionaries defining the values to be entered into the data file, and invalid combinations of otherwise valid entries into the data system. These types of problems can derail an automated system with an efficiency rarely seen in any industry.

Data collection difficulties are not limited to the laboratories, however. It is not uncommon to find that field data is as problematic as anything generated by the laboratory. Often, although field data may be accurately recorded on chains of custody, the information is not entered into a project database in a structured and timely fashion.

Typically, the best solution to these types of problems is to screen incoming data using a screening utility that enforces the requirements of the chosen standard. In many cases, it is not enough to check valid values and perform basic logical verifications such as date sequencing, but it is necessary to verify that data submitted for a project is in compliance with all requirements of the governing guidance document. This should include meeting requirements for combinations of test, preparation and leach methods, along with target analyte and spiking lists, and project reporting limits.

It should be clear that of all project participants, the actual data generator is not only the most qualified to ensure the accuracy and completeness of his data, but is also the only one that may legitimately do so. It is not uncommon for data submitted by a sampler or a laboratory to be edited by a data manager in order to meet final reporting requirements, but this practice introduces as much error and liability as it relieves, with the final result being a net loss of legal defensibility and data integrity.

The result of enforcing data quality through comprehensive screening can also be measured financially. Validation for many large environmental projects is often budgeted at the same level as the analytical costs, meaning it is just as expensive to look at data as to produce the results. Using electronic loading and review processes, however, a contractor can produce high quality electronic data at a fraction of the cost of manually producing a package with an equivalent confidence level.

Data Access Limitations
As data management consultants, we frequently find ourselves in meetings discussing the needs of data users, and the topic of data access usually comes up. We have found that there are two camps, those who think that access to electronic data is a major problem, and those who think that there is no accessibility problem at all. Invariably, the people who think there is no problem have direct access to the database - the people with the problem would be everyone else.

In general, there are two levels of user access to data management systems. The first is indirect access in which the end-user submits a written or verbal request to a data manager, who queries the database for the appropriate information and returns it in the form of either an ASCII export, or some type of report. The second method is controlled, direct access to the database through a graphical user interface (GUI). The most common problem that arises from restricting end-users to indirect access is that it does not always result in timely access to data – requests can sometimes take weeks to be filled, depending upon workload and the complexity of the request. Another telltale sign of not allowing direct access is the creation of multiple, out-of-sync databases.

The Internet has become the ultimate data delivery platform. Commercially available environmental data management tools allow data generators (samplers, laboratories, surveyors, and so on) to electronically transmit data through the Internet to remote servers that apply validation decision rules to qualify it, and make it available immediately to authorized users. In addition to delivering data, such systems allow generators to fully participate as project team members. Laboratory project managers and Quality Assurance Officers (QAOs), for example, can query their laboratory’s data in real time to view the results of data validation, or to check completeness levels.

Uncontrolled Secondary Data Usage
Of increasing concern are data originally collected for a specific purpose or decision, but then made available to others and used for other reasons—secondary data usage. Original data qualified as “estimated” or other data that were “rejected”, may be suitable for other purposes if there are sufficient details about collection, preparation, and analysis, and flagging criteria. Much of the current data in environmental databases is not meta-tagged to provide decision-makers or secondary users enough information to use it properly. Details of sampling, estimates of analytical uncertainty, and descriptions of data transformations or reduction are not tied to the data currently residing in many environmental databases. Third-party validators may not know the sampling details or the analytical uncertainty or how the data may be used beyond its immediate collection purpose. Those performing a final data quality assessment are focused on their particular project, and less on how that data may be of use to others.

Legal Defensibility Safeguards
With our increased reliance on electronic data, the issues of legal defensibility, chain of custody, security and access will need to be resolved. Attempting to address these issues are at least four major statutes, each of which requires federal agencies to develop electronic records:  Paperwork Reduction Act, Government Paperwork Elimination Act, Electronic Freedom of Information Act and the Digital Signature Act/Digital Copyright Act.

The two main questions that arise with any electronic data file are: who generated it, and has anyone altered it? (See Sidebar, Electronic Records Requirements). A number of companies have developed security products that have successfully addressed these issues in many different industries. For example, Verisign (www.verisign.com) offers secure server certificates that enable secure transmission between remote clients and centralized database repositories, and digital signatures that authenticate the originator of electronic files.

NELAC and ISO 17025 Implementation
Key considerations will evolve from the adoption by the National Environmental Laboratory Accreditation Conference (NELAC) and other accrediting bodies of the new International Standard Organization’s (ISO)17025 requirements.

The laboratory must establish and maintain procedures for identification, collection, indexing, access, storage, maintenance and disposal of quality and technical records, including original observations, derived data, sufficient information to establish an audit trail, calibration records, staff records and a copy of each report issued for a defined period.

In addition to the normally required reporting elements, test reports containing the results of sampling shall include, where necessary for the interpretation of test results, the following information: date of sampling; unambiguous identification of substance, matrix, material or product sampled; location of sampling, including any diagrams, sketches or photographs; reference to the sampling plan used; details of any environmental condition during sampling that may affect the interpretation of the test results; identification of the sampling method or procedure used; any standard or other specification for the sampling method or procedure, and deviations, additions to or exclusions from the specification.

Specific requirements that will affect data management within the lab, and by its customers, include uncertainty estimation, which means all data reported will have some sort of descriptors that can then be converted into “meta-tags” for database inclusion, and use by others (secondary data use). The implication is that the industry will have to agree on what descriptors and metrics to apply and report.

These requirements will make it easier for labs to implement Performance Based Measurement Systems (PBMS), and develop project-specific methods as needed for individual clients and sample types. Consequently, data management systems must be able to accommodate a wider range of method characteristics and descriptions than just an EPA method identifier.

Quality Improvement
ISO 17025 focuses on internal improvements within the testing laboratory, with corrective and preventive action programs that require client communications to satisfactorily resolve problems. There must be some mechanism for feedback and improvement to enhance the overall data management process so that mistakes are eliminated and better systems and processes are continually added. A review of hardcopy production systems isn’t sufficient. Examine the flow of data through the lab’s electronic processes and out to the clients. Verify that checks are established at key process points and continually monitored, with feedback to allow changes.

Future Challenges
What do we want data management to look like in the future, and how will we get there? Paper will be reduced or eliminated. All information about samples and sampling can be reduced to electronic formats in the field. The sampling team can e-mail electronic chain-of-custody forms and associated information in advance of sample receipt at the lab. The lab can upload the information into the LIMS to start the sample tracking process. The lab will supplement the field data with the analytical results from processing and testing the samples. The clients’ (or subcontractors’) data processing systems will evaluate electronic data supplied by labs. This will be in several forms, including the current EDDs. Automated systems for review of analytical results against standardized criteria will be required. These systems will have the decision rules built into the programs that process electronic data. Where chemists’ judgment in the past has determined the “goodness” of the data, these decision rules will be made specific enough to substitute for that judgment. For those criteria that required a chemist to examine lab hardcopy output, such as chromatograms and spectra, for various peaks and patterns, the computer will receive electronic versions of the output that in turn will be compared to decision rules. Pattern recognition and other tools will be applied.

For the labs, this means a concomitant reduction or elimination of paperwork, too. Hand-held computers for direct input of sample processing information will replace paper logsheets and workbooks. Output from all lab instruments will be tied to LIMS collection and reporting systems to avoid hand entry. Electronic capture of internal reviews will be documented. Data files for Sample Delivery Groups (SDGs) will be stored in the lab, after zipping and shipping by e-mail to the client. The client may then forward them to the data validation team, who will process the data on a server set up with programs containing the decision rules. The server will apply the rules and qualify data accordingly. Any errors, random or systematic, will be flagged for attention and follow-up. Validation results will be applied to the specific case/SDG reviewed, but QC sample data will also be copied into other databases—sample tracking, individual lab and method performance statistics, and so on. Reports from the server will include comments on the data’s acceptability, based on decision rules derived from the data quality objectives (DQOs) process; what work is accepted or rejected, and the payments to be made; what work is to be repeated; and whether other actions are needed.

Conclusion
Data management practices and procedures can’t make the original data better, but can help to clarify, define, categorize, and display it for better-informed decision making. Current challenges point the way to changes that must be made to improve data management and best use the technologies available to the environmental testing industry. For the future, the goals must include general agreement on a standardized electronic data format with defined data elements, establishment of data storage and user access options, concurrence on indicators or flags for reporting the estimated uncertainty of measurements, and guidance for the continuous improvement of data management systems.

 


Philip Edwards is principal scientist with Synectics, an environmental chemistry and data management consulting firm. Edwards has 15 years’ experience in the environmental industry, specializing in electronic data management, organizational/systems analysis and automation, and database application development. He participated on the team that designed, developed and implemented the industry’s first Internet-enabled data management system, and helped pioneer what is now called the Application Service Provider (ASP) model for delivering technical data services over public networks. Contact him via e-mail at philip_edwards@synectics.net.

Paul Mills is President of Mentorprises Corporation, Ashburn, VA, a consulting firm that offers services in analytical chemistry, QA, data management and environmental investigations. Mills has more than 20 years of experience in managing environmental research and QA programs. He received an EPA Bronze Medal for assistance in establishing the USEPA’s QA program. Mills assisted in environmental investigations at Love Canal and Times Beach, and helped develop and provide QA and technical support to USEPA’s Superfund Contract Laboratory Program. Mills’ professional affiliations include NELAC’s Membership & Outreach Committee, and chairman of the Analytical Laboratories Technical Committee of the ASQ Energy & Environmental Division. Contact him via e-mail at paul.mills@mentorprises.com.