On 30 October 2017 at 13:07, Chris Olson chris_e_olson@yahoo.com wrote:
We have been fortunate to hang onto one of our summer interns for part time work on weekends during the current school year. One of the intern's jobs is to load documents and data which are then processed. The documents are .txt, .docx, and .pdf files. The data files are raw sensor outputs usually captured using ADCs mostly with eight bit precision. All files are loaded or moved from one machine to another with sftp.
The intern noticed right a way that the documents will transfer perfectly from our PPC and SPARC machines to our Intel/CentOS platforms. The raw data files, not so much. There is always an Endian (Thanks Gulliver) issue, which we assume is due to the bytes of data being formatted into 32 bit words somewhere in the Big Endian systems. It is not totally clear why the document files do not have this issue. If there is a known principle behind these observations, we would appreciate very much any information that can shared.
Text files which are ascii are generally 7->8 bit so don't tend to have bit endian problems in 8+ bit architectures. [I expect a 4 bit architecture would have problems]. Now 8+ bit UTF can have some problems with endianess but it is usually not following some standard and assuming that writing data works the same as it did with ascii (mainly because few people dealt with 4 bit computers).
docx and pdf is written for a fixed endian format so even if built/written on a big endian system the data itself is formatted to be little endian. Raw data files are usually endian if they are 'raw' memory dumps or similar. Some 'data' formats which are mostly raw are actually written to a standard which will work because both the little endian and big endian expects the data to be written in 'big' or 'little' endian and read in as such.
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos