Just as fixing a software bug gets exponentially more expensive to correct over time, bad data requires more expense to process and correct after ingestion. One of the best ways to improve accuracy on large data sets is to improve the data during ingestion. When ingesting massive amounts of data; identification, validation, and standardization are often only minimally performed or skipped completely due to the resources needed to perform these steps when the incoming data format is not clearly defined or the fields are not strongly typed.
To validate data, it is essential to be able to recognize known prefixes, suffixes, or substrings in order to validate and standardize the input fields. Typical ways of doing this involve trying subsets of the target string to see if it matches a known value using hash maps, tree-based maps, regular expressions, or remote databases. While functional, these all force users to trade off run time performance and initialization performance.
To improve efficiency, we have developed a program call Fast Data Recognizer (FDR). FDR makes validating massive amounts of data possible with the performance required to scale to very large data sets. FDR combines the best properties of all of the above methods. For a given set of known values and their corresponding metadata (validated/standardized format) , there is a one-time initialization where all of the byte sequences (of which there can be millions) and their corresponding metadata are ‘compiled’ into a memory-mappable state machine file. This allows:
- Nearly zero-cost initialization, simply map the file to memory and it is ready to use, with blocks being paged into memory on-demand and all processes on the machine sharing the same memory pages.
- Look up speed similar to a regular expression which also uses a state machine.
- Simultaneously matching on known values of all lengths from a given starting point in the target string – no need for sequential hash or map lookups.
When a match is found, a callback tells the client code where and what was found and passes the metadata for the known sequence.