Data Profiling – Complete Data Solutions LLC

Intelligence collection and processing requirements evolve with the geopolitical landscape.

Significant technology advancements in computing power and storage capacity enable us to collect and analyze more information than ever. The speed at which we perform the initial evaluation of raw intelligence data to determine content, value, and standardization requirements has not improved. It is a common misconception that today’s extract, transform, and load (ETL) programs automate the data processing steps between collection and availability for analysis.

However, this is not the case. Even with the most sophisticated ETL programs, an engineer must first evaluate the raw data to determine content, value, and standardization requirements. The engineer then codes the results of his/her analysis into an ETL tool. Incredibly, this was the same way it was done 20 years ago. Unfortunately, the initial analysis of the raw data is the most difficult and time consuming portion of the ingest process, not writing the load programs. This has resulted in a significant increase in the amount of time between intelligence collection and its availability to end users. Currently, the only way for organizations to make the data available quicker is to hire more engineers. We cannot sustain this model. Hiring more staff will not solve the problem and is costly. We need to automate the analytic functions the engineers are performing when preparing data for ingest into data stores.

CDS is currently developing Content Examiner (CE) for a research and development organization within DoD. CE is a highly automated data profiling and analysis engine that codifies the analytic processes used to determine field content, value, and standardization requirements. CE incorporates data definitions and processing rules based on authoritative reference data and industry standards, this allows automated, on-site (to include remote) data processing and ingestion without a large technical support staff. CE analyzes structured and semi-structured documents (or structured elements within unstructured documents) and uses the document structures to assist in the identification of individual data elements. It performs a wide variety of statistical calculations and high-speed data matching routines against customized dictionaries to replicate the processes engineers use to isolate patterns necessary to identify content.