I think that there can be some form of standardized format and I suggest the following:
Original = an original .pdf file format, published (selectable text).
Scan = a scanned image, then compiled into a .pdf file (non selectable text).
OCR = a scanned image, compiled into a .pdf file, then...