Our Approach: A Four-Step Process

  • Files Collected100%
  • Files Typically Remaining after Collection and Initial Processing45%
  • Files Typically Remaining after Complete Processing25%
  • Files Typically Remaining after Evaluation14%

Red File Solution’s four-step approach to information classification empowers clients to initiate, manage and maintain access and accountability for all types and formats of documents in support of information governance initiatives.

Leveraging advanced visual classification technology coupled with information governance consulting experts, Red File enables data management, analysis and governance tasks as it automates the collection, classification and governance of large volumes of documents in any file structure, format or type.

 Throughout the process the volume of documents is also substantially reduced through the application of technology and talent.

Step 1: Collect

The collection phase begins with a forensically sound collection of client specified data and provides initial data processing including hashing, logging and copying of collected data.

  • Files Collected100%
  • Files Typically Remaining after Collection and Initial Processing45%

What is a”Data Collector”?

A Data Collector is a USB hardware device that connects to a server on the network to collect potentially relevant data from client workstations, server and other file shares. Data Collectors reduce the number of files preserved and speeds the collection process.

Red File provides client’s with a USB device (collector) that connect on a client’s network to collect data. Prior to use, a file is placed on each device that instructs the collector what paths or devices to consider. As files are collected they are encrypted and compressed using algorithms that meet the highest security standards. Initial data processing in this phase includes:

SHA Hashing. The collector calculates the SHA hash value for each file it encounters and uses these values as a basis for excluding known system or software related files, and for deduping the files it collects.

Logging. The collector logs all files it encounters whether or not each file is copied onto the Collector. The log files enable Red File to provide accountability for all files examined.

Excluding Known System Files. Files whose SHA hash values match entries on Red File’s known system file list are logged to record the name, size, location and hash value of the file, but the files are not copied onto the collector.

Onboard Deduping. The collector contains a list of hash values of all files collected by Red File prior to deployment of the collector and this list of already collected files is supplemented with the hash values of additional files it evaluates. The collector’s log is updated to add the file name, file size, location, and hash value of all files that are evaluated and it collects only a single instance of non-systems files. Note that some collection systems may collect only single instances but fail to track information about where all the copies were located. This is not the case with Red File’s collection technology.

Container Files. The collector does not unzip or decompress container files like *.zip or *.rar files to attempt to dedupe files contained within those container files. The onboard deduping is limited to deduping the overall container files.

Once the collection phase is complete, the next step in the Red File approach to classification is the processing phase.

Step 2: Process

The processing phase takes initially processed data  from the collection phase and completes processing on that data through expansion, hashing and clustering.

  • Files Collected100%
  • Files Typically Remaining after Collection and Initial Processing45%
  • Files Typically Remaining after Complete Processing25%

Upon receipt of a collector, the collector’s contents are copied into Red File’s visual classification platform. At this point container files are unzipped or decompressed and another SHA deduping process takes place. Duplicates are then logged and unique files are then clustered.

The Red File visual technology platform clusters the single-instance, unique files based on visual similarity. The documents in each cluster look substantially alike and evaluations of all the documents in each cluster can be made by examining one or two documents per cluster.

Key features of this phase include:

Comprehensiveness. While all documents can be represented visually, not all document files have associated text. In some collections, 20-30% or more of the documents may have no text, e.g., image-only PDF or TIF files. Visual classification is the only technology that bases its clustering on the visual representations of documents. The technology normalizes the documents so they can be compared regardless of the type of file in which they were located. Other technologies are text-restricted, meaning they examine or evaluate only the text associated with the document files and hence will have limited or no ability to cluster or classify documents that have no text or have poor quality text.

Scalability. Processing throughput can be increased by adding more computing resources.

Data-driven Automation. The clustering is automatic. The groups are self-forming without direction from operators or consultants. Clients can begin working with the clusters within just hours after the beginning of the clustering process, dramatically shortening project launch timelines.

Exportability. The cluster IDs for each cluster can be exported to downstream platforms to enable them to make use of the visual similarities of documents in the same clusters. For example, in eDiscovery review, reviewers can be assigned documents from the same clusters making the review faster and far more consistent than would otherwise be the case.

Once the processing phase is complete, the next step in the Red File approach to classification is the evaluation phase.

Step 3: Evaluate

The evaluation phase takes processed and clustered data and enables knowledge experts to evaluate data based on information requirements, document types, and document attributes.

  • Files Collected100%
  • Files Typically Remaining after Collection and Initial Processing45%
  • Files Typically Remaining after Complete Processing25%
  • Files Typically Remaining after Evaluation14%

Information Requirement (In and Out) Evaluation

As clusters are formed by the first documents being processed, client knowledge workers or subject matter experts begin evaluating them, starting with the clusters with the most documents first. The nature of the evaluation will vary with the purpose of the process. For information governance reviews, the evaluation is whether the documents have ongoing business, regulatory, or legal value and hence ought to be retained, or whether they can be disposed of. In eDiscovery reviews the evaluation will be whether the documents are relevant to a document request or otherwise relevant to the litigation.

Two key features of in and out evaluation include:

Persistence. Evaluation decisions are persistent, meaning they will be applied to documents that are subsequently added to the clusters. The persistence characteristic permits clients to begin evaluation as soon as the clusters start forming, i.e., they do not need to wait until all processing has concluded.

Rolling Intelligence. Decisions made in one information governance initiative can be rolled forward into subsequent decisions. For example, if a cluster is evaluated as a record during a file share remediation initiative, that decision does not have to be made again in a content migration initiative or a paper archive digitization initiative. Projects keep getting easier to do because more and more intelligence is accumulated and rolled forward.

Document Type Evaluation

Document-type labels can be applied to clusters that are being retained. The Red File visual classification platform provides a user-definable three-level document-type taxonomy or tree for this purpose.

Typically clients use the first level for business unit or function, the second layer for document type, and the third layer for sub-type. Multiple clusters may have the same document-type label, e.g., there may be several clusters that are labeled as “Invoices.”

The labeling decisions are persistent and are applied to documents that are subsequently added to the clusters that use that label.

The “Differentiator” of Visual Classification Powered Search

Because of the unique way that visual classification technology indexes the glyphs or graphical elements on each page of each document it can provide fixed and relative positional search operators for finding documents. Fixed positional searches look for items with the specified coordinates. Relative operators look for within a set of coordinates identified relative to other search specifications. Searches can specify ranges for dates, numbers, and text values, and complex searches can be built using the Reverse Polish Notation (“RPN”) search logic. Find is very useful when trying to differentiate among documents within a cluster or for identifying documents within the entire collection.

Document Attribute Evaluation

Attribution involves extracting specific data elements from specific document types, e.g., pulling the API well number from a well log or a borrower’s social security number from a loan application. Because of the visual similarity of documents in the same clusters, extracting attributes can be as simple as clicking and dragging to designate what data elements to extract.

Red File’s visual classification platform provides numerous delimiters that help specify what data elements to use to help identify the data elements of interest and filters to provide a variety of ways to format the extracted information.

Where documents are of such poor quality that automated attribution does not yield satisfactory results, visual classification technology can be used to have data entry specialists key those portions of the documents that contain the desired attributes. In doing this, the portions of the document image being entered are split off from or disassociated from the rest of the document image so the person doing the entry does not see or know the other information on the page or document. For example, the person keying the loan applicant’s name would never see the part of the document that has the address or social security number.

Once the evaluation phase is complete, the next step in the Red File approach to classification is the management phase.

Step 4: Manage

The management phase takes collected, processed, and evaluated data and prepares it for use in a client specified content repository or for use in Red File’s hosted data repository.

  • Files Collected100%
  • Files Typically Exported after Complete Approach14%

Red File can export documents, document-type labels, and attributed data in any format desired by the client for their content management system, including PDF, PDF with text, TIF, TIF with Text, CSV, PDF with internal metadata, etc. Red File can also build load files to virtually any specification.

As part of the final export process, Red File can “redupe” or repopulate duplicates to meet unique requirements of the client. For example, in eDiscovery cases, sometimes agreement with the opposing counsel may require that documents not be deduped. In that case, Red File can recreate the original file sets of all the duplicates of the documents being produced.

Data Export – Visual Deduping

Many times there are several versions of a document that are visually indistinguishable from one another. For example, if a Word document is also saved as a PDF, and then one of those copies is printed and ultimately scanned to a TIF format, someone looking at the three versions could not identify any differences among the Word, PDF, and TIF versions. These are “visual duplicates.” Normal duplicate detection using MD5 of SHA hash values will not identify these as duplicates because they are in different file formats and will not be bit-for-bit duplicates.

Red File can add another layer of duplicate content detection using visual deduping to identify unique content. Clients can elect which visual duplicate or duplicates to retain in their ECM system.

Data Export – Redaction

Red File can redact PII or other sensitive data using two complementary approaches. The first is based on identifying text strings that represent PII such as social security numbers or email addresses. Because of the way Red File processes documents, it knows the page coordinates of each glyph or graphical element, and can place redactions that are very precise in obliterating the redacted data without compromising adjacent non-redacted information.

The second approach is to redact zone coordinates on pages within clusters. For example in the cluster for IRS Form 1099 the whole block in which social security numbers are supposed to be written could be redacted even if the entry on some forms was handwritten. Redactions performed with visual classification technology from Red File can scale to over 700,000 redactions per hour/per server employed.

Example Redaction Based on Text Coordinates

The second approach is to redact zone coordinates on pages within clusters. For example in the cluster for IRS Form 1099 the whole block in which social security numbers are supposed to be written could be redacted even if the entry on some forms was handwritten. Redactions performed with visual classification technology from Red File can scale to over 700,000 redactions per hour/per server employed.

Example Redaction Based on Zone Coordinates

The “Differentiator” of Visual Classification Powered Redaction

Very few document collections have documents with 100% accurate text. Red File’s ability to analyze and group documents based on visual appearance provides far greater assurance that all the terms that need to be redacted are in fact redacted.

Contact us today to learn more.

Red File Solutions is the world’s foremost provider of technology consulting and services for the implementation, management, and hosting of visual classification technology in corporate and governmental information governance pilots, projects, and programs.