INTERNET-DRAFT D. Endler Dalen Knowledge Systems May-2013 The ARK Data Format for Archiving in an RC4 Formatted File Contents 1. Introduction....................................................... 2. Licensing.......................................................... 3. ARK Format Description............................................. 4. Extension Identifiers for the Archive Header....................... 5. Extension Identifiers for the Entry Header......................... 6. Contents of the RC4 File Fields.................................... 7. The Source Folder of the Archive................................... 8. Structure of Folders............................................... 9. References......................................................... 1 - INTRODUCTION ---------------- This specification describes the ARK data format used to archive a group of folders and files to be encrypted into the file_data field of the RC4 Formatted File. Using this format it's possible to encrypt into a single RC4 formatted file a structure of folders containing several files. Also the information about the encrypted files (such as the original file name, file size, data hash and creation date) are safeguarded because they're encrypted together with the data. The ARK specification also have support for file compression, using the ZLIB Compressed Data Format Specification described in [RFC 1950]. To store encrypted ARK formatted data in the field file_data of an RC4 Formatted File, The following extention identifier must have the corresponding values: FILE-CONTENT = "ENCRYPTED ARK" FILE-MIME-TYPE = "application/x-ark" (for archive-only ARK format) FILE-MIME-TYPE = "application/x-ark-zlib" (for compressed ARK format) 2 - LICENSING ------------- Copyright (C) 2013 Dalen Knowledge Systems. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license can be found at . 3 - ARK FORMAT DESCRIPTION -------------------------- The ARK format is described below using the augmented Backus-Naur Form (BNF), as described in the section 2.1 of the [RFC 2616]. The ARK format complies with the following rules: file_data = Cipher(ark_data) ark_data = ark_header 1*ark_entry ark_end ark_header = ark_header_signature *ark_extension_identifier end_of_ark_extension_identifier ark_header_signature = "ARK_FILE" ark_entry = entry_header [entry_data] entry_checksum entry_header = entry_header_signature *ark_extension_identifier end_of_ark_extension_identifier entry_header_signature = "ARKENTRY" entry_data = (data_file | Compress(data_file)) data_file = (1*(2^64)OCTET entry_checksum = (entry_sha256_checksum] | 32) entry_sha256_checksum = 32OCTET ark_end = ark_end_signature ark_sha256_checksum ark_end_signature = "ENDOFARK" ark_sha256_checksum = 32OCTET ark_extension_identifier = id_size id_content id_size = SHORTINT id_content = id_name ["=" id_value] 1* id_name = 1*TEXT id_value = 1*TEXT end_of_ark_extension_identifier = 2 SHORTINT = <2OCTET representing the integer value (1st OCTET) + 256 * (2nd OCTET)> OCTET = TEXT = Each field of the archive is described below: file_data This is the file data field, as described in the RC4 File Format Specification (see [RC4 File Format]). Cipher() This is the function that performs the plaintext data encryption into the ciphertext. ark_data This is the ARK fomatted data. It's composed by one archive header that contains the archive general parameters, one or more entries, that contains data from folders and files and a trailing end information that contains the checksum to validate the integrity of the archive. ark_header The header of the archive contains general parameters for the archive. It's composed by the header signature and zero or more extension identifiers. ark_header_signature The sequence of 8 OCTETs 0x41 0x52 0x4B 0x5F 0x46 0x49 0x4C 0x45, resulting the text "ARK_FILE". When the archive is decrypted, if the ark_data field does not start with this sequence, it is very likely that the decryption key is incorrect and the decryption process must be aborted. ark_entry This is the entry that can contain either a folder definition or a file data. It's composed by a header, the entry data and the calculated SHA-256 from the entry data. entry_header The header of the entry that contains information about the entry. It's composed by the entry header signature and zero or more extension identifiers. entry_header_signature The sequence of 8 OCTETs 0x41 0x52 0x4B 0x45 0x4E 0x54 0x52 0x59, resulting the text "ARKENTRY". If a software tool finds the ark_end_signature value insted of this, then there's no more remaining entries to be read from the archive. If this field contains neither of these sequences, then the archive must be considered corrupted and the decryption process must be aborted. entry_data If the entry is a file, then this field contains the data of the file. If the entry is a directory, then this field is not present. If the ARK format is an archive-only format (MIME-TYPE="application/x-ark") then this field must contain the file data itself. If the ARK format is a compressed archive (MIME-TYPE="application/x-ark-zlib"), then this field must contain the resulting compressed data from the file data, using the ZLIB Compressed Data Format as specified in [RFC 1950]. Compress() This is the function that performs the ZLIB data compression. data_file The binary content of the file data. For archive-only ARK format, this field is the entry_data field itself. entry_checksum For files, this field contains the SHA-256 checksum of the data from the field data_file. For folders, this field must be filled by a sequence of 32 OCTETs 0x00. entry_sha256_checksum This field contains the SHA-256 hash obtained from the contents of the field data_file of the entry. The SHA-256 hash must be calculated using the US Secure Hash Algorithm described in both [RFC 3174] and [RFC 6234]. This code must be used by a software tool to verify the integrity of the decrypted file, comparing it to the SHA-256 hash obtained from the file after the decryption. If this field doesn't match the calculated SHA-256, the software tool must warn, but it must continue trying to decrypt the next entry of the archive. Note that if the file is compressed, the SHA-256 checksum must be calculated from the original file data, not from the compressed data. ark_end This field indicates that there's no more remaining entries in the archive. It's composed by the signature and the SHA-256 checksum of the archive. ark_end_signature The sequence of 8 OCTETs 0x45 0x4E 0x44 0x4F 0x46 0x41 0x52 0x4B, resulting the text "ENDOFARK". When trying to get the next entry, the software tool must check the signature. If it obtains this sequence, then there are no more entries to be read. ark_sha256_checksum This field contains the SHA-256 hash obtained from the contents of the field ark_header and all the fields ark_entry. The ark_sha256_checksum does not include the field ark_end. For a compressed ARK format, this field must be calculated using the compressed data, not the original one. ark_extension_identifier A pair "name" and "value" that provides information about the archive or the entry. This sequence can be repeated as many times as needed. It's composed by the id_size and the id_content. See the sections "EXTENSION IDENTIFIERS FOR THE ARCHIVE HEADER" and "EXTENSION IDENTIFIERS FOR THE ENTRY HEADER" for information about extension identifiers for the archive header and the entry header respectively. id_size The number of OCTETs of the corresponding id_content field (including the ending OCTETs 0x00). It has 2 OCTETs representing an integer value, in witch the LSB comes first. id_content The extension identifier content, having the id_name and the optional corresponding id_value (both in TEXT format). Binary content for the identifier value must be encoded to TEXT using some text data encoding. end_of_ark_extension_identifier This field indicates that there are no more further extension identifiers. A software tool must obtain the extension identifiers until it finds an extension_identifier with id_size equals to 0, in witch case is the end_of_extension_identifier field. 4 - EXTENSION IDENTIFIERS FOR THE ARCHIVE HEADER ------------------------------------------------ The following extension identifiers apply to the archive header: ARCHIVE-SIZE Contains the sum of the sizes of all entries within the archive. This size represents the sum of the data_file field size of every entry of the archive. It is smaller than the total size of the archive because it includes neither the size of the headers nor the sha256_checksum fields. It must be in the format: 1NOZERODIGIT *DIGIT NOZERODIGIT = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' DIGIT = '0' | NOZERODIGIT (e.g. "ARCHIVE-SIZE=10234"). Other implementation specific extension identifiers are allowed. Software tools must ignore additional unknown extention identitiers in the archive header. 5 - EXTENSION IDENTIFIERS FOR THE ENTRY HEADER ---------------------------------------------- The following extension identifiers apply to each archive entry header: ENTRY-TYPE Defines what is the type of the entry. The value DIRECTORY indicates that the entry is a folder. The value FILE indicates that the entry is a file. (e.g. "ENTRY-TYPE=FILE"). ENTRY-NAME Contains the name of the folder or file. The path separator character of the folder is the slash character (/). If the entry is a folder, it should contain the path relative to the source folder of the archive and must end with the path separator character. If the entry is a file, it must be preceded by the path starting from the source folder of the archive. The format description of this identifier is as follow: (folder_name | file_name) folder_name = 1*(1*TEXT '/') ; (e.g. "ENTRY-NAME=src/trunk/") file_name = 0*(1*TEXT '/') 1*TEXT ; (e.g. "ENTRY-NAME=src/trunk/file.txt") ENTRY-SIZE Contains the number of OCTETS of the entry field data_file. It represents the original size of the file. This identifier does not apply to entries of the folder type. It must be in the format: 1NOZERODIGIT *DIGIT NOZERODIGIT = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' DIGIT = '0' | NOZERODIGIT (e.g. "ENTRY-SIZE=10234"). ENTRY-COMPRESSED-SIZE This identifier applies only to compressed entries. It contains the size of the field entry_data. It's the size of the resulting compressed file. It must be in the format: 1NOZERODIGIT *DIGIT NOZERODIGIT = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' DIGIT = '0' | NOZERODIGIT (e.g. "ENTRY-COMPRESSED-SIZE=6891"). Note that it implies that the compressed file size must be known before witing the entry_data to the archive. ENTRY-MDATE This indicates the modification date and time of the entry. The time zone must be the local time and the value must be in the format: year '-' month '-' day 'T' hour ':' minute ':' second year = NOZERODIGIT 3DIGIT month = <2DIGIT, from '01' to '12'> day = <2DIGIT, from '01' to '31'. The values '29', '30' and '31' must be in accordance with the corresponding month.> hour = <2DIGIT, from '01' to '23'> minute = <2DIGIT, from '01' to '59'> second = <2DIGIT, from '01' to '60'. The value '60' is used only in the case of a leap second.> (e.g. "ENTRY-DATE=2012-08-20T21:15:04"). Refer to the [RFC 3339] about date and time formats. If not present, then the modification date and time of the entry shall be ignored. ENTRY-MIME-TYPE This extension identifier applies only to file type entries. This indicates the format of the file data. The value must be in the Multipurpose Internet Mail Extensions (MIME) format, as described in [RFC 2045] [RFC 2046] and [RFC 2047]. If not present, then the value "application/octet-stream" must be assumed. Other implementation specific extension identifiers are allowed. Software tools must ignore additional unknown extention identitiers in the entry header. 6 - CONTENTS OF THE RC4 FILE FIELDS ----------------------------------- When using the ARK Format data, the following rules must be followed for the contents of the RC4 formatted file: extension_identifier The reserved extension identifier "FILE-CONTENT" of the RC4 format must contain the value "ENCRYPTED ARK". The reserved extension identifier "FILE-MIME-TYPE" of the RC4 format must contain either the value "application/x-ark" for archive-only ARK format, or the value "application/x-ark-zlib" for compressed ARK format. The reserved extension identifier "FILE-NAME" of the RC4 format is no longer used, and shall not be included. The extension identifier ENTRY-NAME for each entry is used instead. data_size The RC4 format data_size field is no longer used. It must contain the value zero. The extension identifiers ARCHIVE-SIZE and ENTRY-SIZE must be used to control the archive and file sizes. sha256_checksum The RC4 format sha256_checksum field is no longer used, and must be filled with the sequence of 32 OCTETs 0x00. The corresponding fields ark_sha256_checksum of the archive and entry_sha256_checksum of each entry are used to guarantee the integrity of the files. All the other RC4 file format fields follow the original rules as described in the RC4 File Format Specification. 7 - THE SOURCE FOLDER OF THE ARCHIVE ------------------------------------ An archive contains files and folders from a specific relative folder. This folder from which files and folders are stored is called "source folder of the archive". All the folders and files stored in an ARK archive are referenced by their relative path to this source folder (including the source folder name). For instance, if one creates an ARK formatted file from the absolute path "/usr/local/share/localfiles", the relative path would be "localfiles" and every entry of the archive shall be referenced from this path (e.g. "localfiles/file1.txt", "localfiles/folder1/file2.txt"). 8 - STRUCTURE OF FOLDERS ------------------------ If an archive contains a structure of folders, each one containing its own set of files, then the entries of the archive must follow the order: - The folders in the hierarchical order; - The files only after the corresponding folder. For example, suppose the following structure of folders and files to be archived: Folder: localfiles + File: |- File1.txt File: |- File2.txt | Folder: +- folder1 + File: | |- File3.txt File: | +- File4.txt | Folder: +- folder2 + File: |- File5.txt File: +- File6.txt The sequence stored in the ARK archive can be the following: Folder: localfiles/ File: localfiles/File1.txt File: localfiles/File2.txt Folder: localfiles/folder1/ File: localfiles/folder1/File3.txt File: localfiles/folder2/File5.txt Folder: localfiles/folder2/ File: localfiles/folder2/File6.txt File: localfiles/folder1/File4.txt but NOT the following: Folder: localfiles/folder1/ File: localfiles/folder1/File3.txt File: localfiles/folder2/File5.txt Folder: localfiles/ File: localfiles/File1.txt File: localfiles/File2.txt Folder: localfiles/folder2/ File: localfiles/folder2/File6.txt File: localfiles/folder1/File4.txt Note that, in this case, the folder "localfiles/folder1/" is archived prior to the folder "localfiles/", i.e., they're not in the hierarchical order. 9 - REFERENCES -------------- [RFC 1950] Deutsch, P., Gailly, J-L.,"ZLIB Compressed Data Format Specification version 3.3", RFC 1950, May 1996, [RFC 2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee, T., "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999, [RFC 3174] Eastlake 3rd, D., Jones, P., "US Secure Hash Algorithm 1 (SHA1)", RFC 3174, September 2001, [RFC 3339] Klyne, G., Newman, C., "Date and Time on the Internet: Timestamps", RFC 3339, July 2002, [RFC 6234] Eastlake 3rd, D., Hansen, T., "US Secure Hash Algorithms (SHA and SHA-based HMAC and HKDF)", RFC 6234, May 2011, [RC4 File Format] Endler, D., "RC4 File Format Specification - AL-KINDI Implementation Version 1", October 2012.