|
Primary data, actual empirical results from environmental measurements, are used to make the decision called for in a project’s
DQOs. Secondary data, then, is derivative, using previously collected data for decision-making purposes, other than its original intent. In general, use of previously collected data in the decision process is a use other than the original intent of the data.
EPA Order 5360.1 CHG 1, July 1998; (EPA July, 1998) describes "secondary data" as:
"environmental data collected for other purposes or from other sources, including literature, industry surveys, compilations from computerized data bases and information systems, results from computerized or mathematical models of environmental processes and conditions."
What must you know about the various primary data sets--censored data, outlier treatment, estimated results, screening data, etc.? Standards are needed that allow users to decide whether the data are suitable for their (secondary) use. What should those standards be? How can they be developed and applied to environmental data? Is data quality adequately described by DQO levels and PARCCS indicators? What about other data without such descriptors--caveat emptor? How can you determine that the data collected by someone else for a different purpose can meet the data quality needs of your project? How will projects, regulations, programs involving secondary data usage be prioritized, compared to those engaged in primary (empirical) data gathering? What gets done first, and who decides? Congress and the EPA Administrator? Individual program office heads? Send your comments about this month's
Forum topic to
paul.mills@mentorprises.com.
The following excerpts from recent news items indicate the types of problems associated with data management for secondary uses:
From GOLOB's Environmental Business Week, May 7, 1999, Vol.
XX, No. 19, "EPA issues preliminary RCRA hazardous waste data." "EPA's Office of Solid Waste and Emergency Response
(OSWER) posted the preliminary RCRA hazardous waste as a series of data files on its Internet home page on April 28. OSWER emphasized that the states have the responsibility for collecting and verifying the data. Environmental industry analysts contacted by EBW warned that the data is not in its final form and is likely to contain numerous errors. "Use it cautiously," advised Cary
Perket, president of Environmental Information Ltd. in Minneapolis, Minnesota. "Once they go in and do their proofing, they find data-entry errors," he told
EBW. "It's nice to get the information out there, but it has to be qualified."
From GOLOB's Environmental Business Week, April 23, 1999, Vol.
XX, No. 17, "EDF: Americans face substantial air-toxics risks." "More than 220-million Americans are exposed to toxic air pollutants, or "air toxics," at levels that are at least 100 times the permissible levels set by Congress as a goal in the Clean Air Act
(CAA) Amendments of 190, according to the Environmental Defense Fund (EDF). Basing its findings on EPA data, EDF also concluded that as many as 11-million Americans face cancer risks from air toxics that are 1,000 times the goals specified in the 1990 CAA Amendments." ... "In January, EPA suspended the release of a report summarizing the Agency's Cumulative Exposure Project
(CEP) data--a decision criticized by EDF, some state air-quality officials, and even the reports authors
(EBW, Jan. 22, 1999, p. 1).
EPA justified its decision on the basis that the report's data was old--dating back to 1990--that the data were primarily accumulated to test a new computer model, and that the data do not reflect the progress achieved in reducing air-toxics emissions since 1990. Although a print report summarizing the data is not available, EPA has posted the CEP data on the Internet at
http://www.epa.gov/oppecumm." According to EDF and some state air-quality experts, more recent data on air-toxics emissions from eight states--with some measurements as recent as 1997--indicate that the CEP data has a high degree of accuracy. Where the CEP is inaccurate, it more often underestimates the air-toxics emissions levels in US census tracts than overestimates the levels, EDF said.
Toxic Release Inventories and information from EPA reports have been used by independent environmental activist organizations to provide Web-accessible maps of "Hazardous Waste Sites in your Zip Code." The data may not be qualified regarding timeliness, currency, quality, or associated risk. It may be dangerous to release unqualified information for general use. The quality of the data in the database that would be included in such reports must be known and documented. The quality of the primary data in any database is the paramount consideration for any secondary usage. Identify and segregate data by quality, before it is stored in any database! The following are additional recommended guiding principles developed in response to questions about secondary data usage.
Recommended General Principles for Secondary Data Use:
Data Format: Data may be in summary form, electronic format (diskettes, tapes, computer hard drives), or raw form. It may be complete in one set or spread over time across many data sets (multiple SDGs in a case) by a particular analytical method or several methods. The user must specify the data desired.
Data Quality Responsibility: Primary responsibility is held by the originators of the data. Each responsible party in this process has the responsibility not to misuse the data, to twist it from its original purpose or stretch it to cover more. How those qualifiers are used later, by others, must be addressed by providing as many indicators of data quality as possible on the original data set.
Primary Data Quality Description: Data quality can be described by DQO levels, PARCCS indicators, and other data acquisition elements. In EPA’s QA/R-5, October 1998, Elements B1 through B10 list the manner in which data acquisition elements (primary data use) are to be described:
Element B1, Sampling Process Design, says to classify all measurements as either "critical" that is, required to achieve project objectives, or "non-critical" for information purposes only.
The method citations should specify exactly which options were selected for sampling (B2) and analytical methods (B4) selected.
B5, Quality Control, requires specification of the QC procedures needed for each sampling, analysis, or measurement technique, and how the QC statistics were calculated.
B9 discusses data acquisition requirements for decision-making that uses data from non-direct measurement sources, such as computer databases, programs, literature files, and historical databases. "Define the acceptance criteria for the use of such data in the project, and discuss any limitations on the use of the data resulting from uncertainty in its quality. Document the rationale for the original collection of data and indicate its relevance to this project."
B10, Data Management, is a key in this discussion. "Describe the project data management scheme, tracing the path of the data from their generation in the field or laboratory to their final use or storage…Describe or reference the standard record-keeping procedures, document control system, and the approach used for data storage and retrieval on electronic media. Discuss the control mechanism for detecting and correcting errors and for preventing loss of data during data reduction, data reporting, and data entry to forms, reports, and databases."
Data Qualification: Flag the data with qualifiers, and add the "reason codes" to show why it was qualified. Also indicate whether the data are resubmitted/corrected. From Q/R-5:
D1, Data Review, Validation, and Verification Requirements: "State the criteria used to review and validate—that is, accept, reject, or qualify—data in an objective and consistent manner."
D2, Validation and Verification Methods: "Describe the process used for validation and verifying data." If a percentage check were to be applied, what data were assessed, and which were not? What conclusions and qualifications were made and applied across the entire data set? Was there a semi-automated screen, or full manual validation, or both? How do these agree? These amounts may vary depending on individual projects’ needs.
D3, Reconciliation with User Requirements: "Describe how the results obtained from the project or task are reconciled with the requirements defined by the data user or decision maker…Describe how issues will be resolved and limitations on the use of the data will be reported."
Outlier Treatment: The originators should state if there were any data "censored," or not used for some reason. The metadata tags must indicate data that failed acceptance criteria.
Data Without Descriptors: Data that do not clearly state the data acquisition elements and acceptance criteria are less certain, and should be used with greater care in drawing any conclusions.
Errors Identified in the Measurements or the Measurement System: These must be reported with the original data and maintained with any secondary use, to include adjustments made to reporting limits. The data and reporting limits may be adjusted for % moisture content of samples, dilutions, sub-optimal sample volumes, or other reasons.. Quality control information is reported so the user can readily determine the affects of sample matrixes upon recovery and reproducibility, and if any contamination occurred during sample shipment, storage, and processing.
Data Maintenance Responsibility: The producer of the original primary data has the responsibility for maintaining its security and integrity. If the data are released in any form, the qualifiers associated with the data acquisition elements must be taken as an integral part of the data. The database should have a tracking system that indicates who accessed it, when, and what changes were made to the data. Database security procedures are essential in maintaining data that can be trusted. Any data reduction of the primary data set must be described, stating the algorithms used and the extent of transformations, their purposes and effect. It must be possible to reconstruct the data back to its original source. For a continuing stream of data into a database over time, version control for methods must document the specific sampling and analytical methods used to produce primary data, and any changes and their effects.
Modeling Considerations: Is the quality of data acceptable for the model? Are the scales of measurement used by the model consistent with that of the original data? If pollutant transport models, for example, were used, their version must be noted, along with any information about the inputs that applied. Are project-specific data appropriate for use in a programmatic model? Assumptions about the data, and the decisions made must be clearly stated. For example, "non-detect" data may be treated as "zero," or as ½ the
MDL, or the CRDL, or the IDL.
QAPP: If environmental measurement data are needed for a particular decision, the proper EPA-approved approach is to prepare a QAPP and establish
DQOs. This has been traditional for primary data collection/gathering. If other data are to be used for this decision, that were not produced by an experimental design tailored for this decision, then that data should be evaluated for this secondary use. If there are statistical metrics associated with the data set that describe its "goodness" then it can be looked at, post-hoc, to see if it meets the established DQOs for the current project. If a DQA has been done, this should provide the metrics needed to determine if it meets the necessary
DQOs. If not, then an independent DQA would have to be performed, applying the DQOs for the project awaiting a decision.
Interpretation through statistical analysis or assessment: This is the DQA data-evaluation step described above. It could also apply to the retrospective DQO step. However, if the statistical metrics are lacking, some may have to be imputed, or qualify the data to "estimated" for further use. A variety of commercial, state, and federal program data may be submitted for EPA consideration for regulations. How trustworthy and complete will it be, having been produced in some cases with less rigorous criteria than in others? Is there a minimum set of criteria to characterize the data for secondary use, and an acceptance range or control limits for that criteria? With less-than-perfect data, you must determine if the set is sufficiently complete, representative, with enough accuracy and precision and sensitivity, to be useful.
Additional potential burden on a program's QA system: People who would otherwise be evaluating QAPPs and DQOs for primary data gathering will be looking instead at the use of historical data, from various sources, and applying similar criteria. The resources needed will depend upon how much more data are needed for decision-making purposes, which should be determined in the
DQOs. If the data are being collected be able to look at trends rather than making a specific decision, however, there may be no endpoint. If data from commercial, state, other federal, even international agencies are considered candidates for inclusion in EPA databases, then the resources needed will be proportional to the mass of data to be evaluated.
|