Basics
Some basic topics such as number systems, character sets and encodings are covered here.
Contents
- Number systems
- Data sizes
- Endianness / Byte order
- Character sets and character encodings
- Encoding
- Date formats and time formats
- Slack
Were the free content on my website helpful for you?
Support the further free publication with a donation via PayPal.
Number systems
Decimal system
The decimal system is the standard system for representing numbers. The base of the decimal system is 10 and includes the digits 0 through 9 to represent numbers.
Binary system
The binary system, or dual system, is primarily used in electronic data processing (base 2). The two possible digits in a binary value are 0 and 1. By stringing the digits 0 and 1 together, any number can be expressed:
Number | Binary value |
---|---|
0 | 0 |
1 | 1 |
2 | 10 |
3 | 11 |
4 | 100 |
12 | 1100 |
64 | 1000000 |
128 | 10000000 |
255 | 11111111 |
Octal system
The octal system uses only the digits 0 through 7 to represent numbers (base 8). The decimal number 8 thus results in the octal value 10. This system is primarily used for file permissions in Linux-based operating systems and FTP programs.
Hexadecimal system
In the hexadecimal system, the digits 0 to 9 and the letters A to F are used to represent numbers (base 16). "A" corresponds to the number 10 and "F" to the number 15. Thus, numbers from 0 to 255, which corresponds to one byte, can be expressed with a two-digit value—from "00" to "FF". Since a hexadecimal number cannot always be distinguished from a decimal number, a hexadecimal number is denoted by a leading "0x" or trailing "h", for example, 0x47 or 47h, 0x00, or 0h.
Data sizes
Bit
A bit (binary digit) is the smallest possible unit in information processing. A bit can only assume two states - 0 or 1 (see also binary system).
Nibble
A nibble consists of 4 bits (half a byte). However, this size specification is rarely used. A nibble can be represented with just one hexadecimal digit.
Byte
A byte consists of 8 bits and thus contains numbers in the range 0 to 255.
Word
A word consists of two bytes or 16 bits. Thus, numbers from 0 to 65,535 can be stored in a word. For example, the DOS time is stored in this data size to an accuracy of 2 seconds.
DWord
A DWord (Double Word) comprises four bytes or 32 bits. This results in a number range from 0 to 4,294,967,295. In addition to "normal" numbers, date and time information is also stored in this format.
QWord
A QWord (Quad Word) consists of eight bytes or 64 bits. This data size is used to store hard drive capacities or particularly precise time information, for example.
Schematic representations from bit to byte and from byte to QWord:
Bit | Bit | Bit | Bit | Bit | Bit | Bit | Bit |
Nibble | Nibble | ||||||
Byte |
Byte | Byte | Byte | Byte | Byte | Byte | Byte | Byte |
Word | Word | Word | Word | ||||
DWord | DWord | ||||||
QWord |
Endianness / Byte Order
When saving or reading data, the order in which the data is processed is crucial. This can occur from left to right or vice versa. This order is called endianness or byte order. Two types are distinguished: Intel and Motorola byte order.
While DWord values that express a number are usually stored in Intel format, positional information, such as file offsets, is usually expressed in Motorola format.
Intel / LSB / Little Endian
The Intel byte order is also called LSB (least significant byte first) or little endian. In the Intel byte order, the least significant byte is in the left position and the most significant byte is in the right position. For example, word 0x0100 results in the value 1 and word 0x0001 in the value 256.
Motorola / MSB / Big Endian
The Motorola byte order is also called MSB (most significant byte first) or big endian. In Motorola byte order, the least significant byte is in the rightmost position and the most significant byte is in the leftmost position. Here, word 0x0001 results in the value 1 and word 0x0100 in the value 256.
Character sets and character encodings
A character set specifies which characters can be used. These include letters, numbers, punctuation, and special characters. In a character encoding, each character is assigned a unique number, called a character code. The character code usually starts at 0 (zero).
Types of character sets
Single-byte character sets (SBCS) store a character in one byte. Examples include ASCII, MS-DOS, and Windows code pages, as well as ISO-8859 character sets.
Multibyte character sets (MBCS) store a character in a varying number of bytes. These include UTF-8, UTF-16, and Big5.
Other character sets use a fixed number of bytes, such as 2 bytes for UCS-2 or UTF-32.
UTF-16 does not use a fixed number of bytes, but generally 2 bytes per character, but can also use 4 bytes per character.
Common character sets
ASCII
The ASCII character set contains 128 characters. Seven bits are required to store them. The 8th bit is not used in ASCII. In addition to various special characters, it primarily includes uppercase and lowercase letters, numbers, and punctuation marks.
ANSI / 8-bit character sets
8-bit character sets are also known as ANSI character sets. These extend the ASCII character set by also using the 8th bit, allowing a total of 256 characters to be represented. The additional characters are usually used for national special characters and other symbols.
MS-DOS codepages
437 | English |
708 | Arabic (ASMO) |
720 | Arabic (Microsoft) |
737 | Greek |
775 | Baltic |
850 | Western European |
852 | Central European |
855 | Cyrillic |
857 | Turkish |
858 | Western European with Euro |
860 | Portuguese |
861 | Icelandic |
862 | Hebrew |
863 | Canadian French |
864 | Arabic (IBM) |
865 | Nordic |
866 | Russian |
869 | Greek |
Windows codepages
874 | Thai |
932 | Japanese |
936 | Simplified Chinese |
949 | Korean |
950 | Traditional Chinese |
1200 | Unicode UTF-16 LE |
1201 | Unicode UTF-16 BE |
1250 | Central European |
1251 | Cyrillic |
1252 | Western European |
1253 | Greek |
1254 | Turkish |
1255 | Hebrew |
1256 | Arabic |
1257 | Baltic |
1258 | Vietnamese |
12000 | Unicode UTF-32 LE |
12001 | Unicode UTF-32 BE |
65000 | Unicode UTF-7 |
65001 | Unicode UTF-8 |
ISO 8859
ISO 8859-1 | Latin-1, Western European |
ISO 8859-2 | Latin-2, Central European |
ISO 8859-3 | Latin-3, Southern European |
ISO 8859-4 | Latin-4, Northern European |
ISO 8859-5 | Cyrillic |
ISO 8859-6 | Arabic |
ISO 8859-7 | Greek |
ISO 8859-8 | Hebrew |
ISO 8859-9 | Latin-5, Turkish |
ISO 8859-10 | Latin-6, Nordic |
ISO 8859-11 | Thai |
ISO 8859-12 | (not assigned) |
ISO 8859-13 | Latin-7, Baltic |
ISO 8859-14 | Latin-8, Celtic |
ISO 8859-15 | Latin-9, Western European |
ISO 8859-16 | Latin-10, Southeastern European |
Unicode
Unicode is a character set that uses a variable number of bytes per character. The Unicode character set primarily uses the UTF-8 and UTF-16 encodings to store characters. UTF stands for Unicode Transformation Format.
UTF-8
UTF-8 uses a sequence of 8-bit numbers (one byte) to store characters. ASCII characters are stored unchanged. This means that letters, numbers, and punctuation marks only require one byte per character in UTF-8 encoding. All other characters, such as national symbols and characters from other writing systems, require two to four bytes per character.
In binary notation, all 1-byte characters begin with a zero. For multi-byte characters, the first byte begins with binary ones, followed by a binary zero. The number of binary ones corresponds to the number of bytes per character.
Examples:
Umlaut A (Ä) | 11000011 10000100 |
---|---|
Euro character (€) | 11100010 10000010 10101100 |
Teddy bear (U+1F9F8) | 11110000 10011111 10100111 10111000 |
If a file saved in UTF-8 is interpreted as ISO 8859-1, UTF-8 characters will not be displayed correctly. For example, the German umlauts "ä ö ü" would be displayed as "ä ö ü". If text in ISO 8859-1 is displayed as UTF-8, a question mark or the "�" character is displayed for all invalid characters.
UTF-16
UTF-16 uses a sequence of 16-bit numbers (two bytes or 1 word) to store characters. All characters from U+0000 to U+FFFF occupy 2 bytes.
The characters U+10000 to U+10FFFF are each stored in 4 bytes. The number 65536 (0xFFFF) is subtracted from the character code. The resulting 20-bit number is divided into two 10-bit blocks. The first block, with the 10 most significant bits, is preceded by the bit sequence 110110. These 16 bits are called high surrogates. The second block, with the 10 least significant bits, is preceded by the bit sequence 110111. These 16 bits are called low surrogates.
20 bits (0 to 19) | |||||||||||||||||||
19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
1st block / higher-order bits | 2nd block / low-order bits |
110110 + bits of the 1st block = high surrogates
110111 + bits of the 2nd block = low surrogates
Endianness / Byte Order
The byte order (endianness) determines the position of the 8 bits of an ASCII character within the 16 bits of the UTF-16 word.
With Big Endian (UTF-16BE), a 0 byte is prefixed to the ASCII character.
With Little Endian (UTF-16LE), a 0 byte is appended to the ASCII character.
The high surrogate word always precedes the low surrogate word, regardless of the byte order.
Byte Order Mark
The Byte Order Mark (BOM for short) is located at the beginning of a file and indicates the character encoding used in the subsequent text.
Kodierung | Byte Order Mark hexadezimal |
---|---|
UTF-8 | EF BB BF |
UTF-16 BE | FE FF |
UTF-16 LE | FF FE |
UTF-32 BE | 00 00 FE FF |
UTF-32 LE | FF FE 00 00 |
Tip:
The pages "ASCII and ANSI Character Table" and "Unicode Character Tables" contain character tables for ASCII characters and the most important Unicode characters.
Encoding
In certain situations, binary data must be converted into other characters. This can be for technical reasons, such as data transmission over the Internet, or for practical reasons, such as with a hexadecimal editor.
Conversion to another number system
Encoding is usually just the conversion from one number system to another. For example, from a decimal ASCII/ANSI character code to the hexadecimal system.
The decimal character codes of the word "Test" would be represented as follows in different number systems:
Number system | Base | Representation of the word "test" |
---|---|---|
Decimal | 10 | 84 101 115 116 |
Hexadecimal | 16 | 54 65 73 74 |
Octal | 8 | 124 145 163 164 |
Binary | 2 | 01010100 01100101 01110011 01110100 |
Bei der hexadezimalen Schreibweise werden manchmal die Trennzeichen weggelassen, z.B. "54657374".
Base64
Base64 (B64) is a very common encoding. The base of this encoding is 64, which means it uses 64 characters to represent data. These include all lowercase and uppercase letters, the numbers 0 through 9, the plus sign (+), and the forward slash (/). During encoding, each 3 byte of binary data is converted into 4 coded characters. If the result is not a multiple of 4, the remaining characters are padded with an equal sign (=).
Depending on the application, the Base64 alphabet, especially the order of letters and numbers, can vary.
Example:
This is a Test
As Base64, this results in:
VGhpcyBpcyBhIFRlc3Q=
Quoted-Printable
Quoted-Printable (QP) is mostly used on the internet, for example, in e-mails. It encodes some ASCII characters (0-127) and all characters from 128-255. The encoding is preceded by an equal sign (=) followed by the character's hexadecimal value. An equal sign at the end of a line masks a line break inserted during encoding and is removed during decoding.
Example:
Ein großer Test mit äöü. 5 + 10 = 15 Leerzeichen am Ende
Encoded as a quoted printable, this results in:
Ein gro=DFer Test= mit =E4=F6=FC. 5 + 10 =3D 15 Leerzeichen am Ende=20
URL encoding
To pass reserved or invalid characters in a URL, they are encoded using a percent sign followed by the hexadecimal value of that character.
Example:
.../ein test/äöü/groß/
Resulting in URL encoding:
.../ein%20test/%C3%A4%C3%B6%C3%BC/gro%C3%9F/
Date formats and time formats
DOS date / DOS time
Under DOS (FAT-16), a date value and a time value are each stored in 2 bytes (1 WORD or 16 bits). The bits are divided as follows:
DOS date:
Bits | Description |
---|---|
0 - 4 | Day of the month (1 to 31) |
5 - 8 | Month (1 = January, 2 = February, etc.) |
9 - 15 | Year offset starting from 1980 (0 = 1980, 1 = 1981, etc.) |
DOS time:
Bits | Beschreibung |
---|---|
0 - 4 | Seconds divided by 2 |
5 - 10 | Minutes (0 to 59) |
11 - 15 | Hours (0 to 23) |
Possible date range: 1980-01-01 to 299593-12-31
Unix timestamp
The Unix timestamp, also known as POSIX time, Unix epoch, C time, or time_t, is stored as a DWORD (4 bytes or 32 bits) and is used specifically in the UNIX environment. It indicates the number of seconds that have passed since 00:00:00 on January 1, 1970, and is usually expressed in UTC time.
Possible date range: 1970-01-01 00:00:00 to 2038-01-19 03:14:07 for 32-bit systems
Windows FILETIME
The Windows FILETIME is stored as a structure consisting of two DWORD values (32-bit unsigned integer) (8 bytes and 64 bits respectively) and corresponds to the number of intervals of 100 nanoseconds since January 1, 1601 (UTC).
Possible date range: 1601-01-01 00:00:00 to 30828-11-11
Windows SYSTEMTIME
The Windows SYSTEMTIME is stored as a structure consisting of 8 WORD values. It contains each date and time component (year, month, day, day of the week, hour, minute, second, and millisecond) as a separate 16-bit integer. The time is specified either in the local time zone or in UTC, depending on the function called.
Possible date range: 1601-01-01 00:00:00.000 to 30827-12-31 23:59:59.999
Mach Absolute Time
Mach Absolute Time, also known as Mac Absolute Time, Apple Cocoa Core Data Timestamp, or CFAbsoluteTime, represents a point in time relative to the absolute reference time, January 1, 2001, 00:00:00 GMT. The time value is stored as a 32-bit signed integer. Time is measured by the number of seconds between the reference time and the specified time. Negative values indicate a time before the reference time, positive values indicate a time after the reference time.
ISO 8601
The ISO 8601 time format is a text-based representation of a date and time value. The format is "YYYY-MM-DD hh:mm:ss" (4-digit year, followed by the 2-digit month and day, as well as the hours, minutes, and seconds).
Possible date range: 0001-01-01 00:00:00 to 9999-12-31 23:59:59
Slack
File slack, RAM slack and drive slack
The storage space of a data storage device (a so-called block-oriented mass storage device) is divided into sectors. Typically, hard drives use 512 bytes, CDs and DVDs 2048 bytes, and newer hard drives and SSDs (solid-state drives) 4096 bytes per sector.
Clusters are groups of one to 128 sectors. Clusters are used to store files, with a file always starting at the beginning of a cluster.
The following table assumes that a cluster consists of 8 sectors and that the file contents are not a multiple of the sector size and end within the 6th sector. Since data is written to the storage device sector by sector, the missing bytes must be filled up to the end of the sector. Previously, random RAM contents were used for this purpose, which is why it was called "RAM slack". The Drive Slack is usually not overwritten and usually still contains the data that was located at this location before the file was saved and was previously associated with a file that has since been deleted.
Sectors 1 - 5 | Sector 6 | Sector 7 | Sector 8 | |
---|---|---|---|---|
File content | Last part of file content |
RAM slack | Drive slack | |
File slack |
File slack encompasses the area from the end of the file to the end of the last used cluster of the file (RAM slack + drive slack).
File slack is sometimes referred to as file offset or slack space.
MFT slack
In NTFS, information about a file is stored in a file record. This contains data such as the file name, file times, file size, and position in the file system. At least 1 KB or the cluster size (usually 4 KB) is reserved for a file record. If the file to be stored is not larger than the unused space in the file record, the contents are stored directly in the file record. If the file is enlarged and there is no more space for it in the file record, the file is stored in the data area of the partition. The original file contents can then remain in the file record.
Partition slack
Partition slack is the unused space between two partitions and after the last partition up to the end of the disk. If data was already stored in these areas before partitioning, it may have been retained.
Were the free content on my website helpful for you?
Support the further free publication with a donation via PayPal.