Basics

Some basic topics such as number systems, character sets and encodings are covered here.

Contents

Were the free content on my website helpful for you?
Support the further free publication with a donation via PayPal.

Read more about support options...

Number systems

Decimal system

The decimal system is the standard system for representing numbers. The base of the decimal system is 10 and includes the digits 0 through 9 to represent numbers.

Binary system

The binary system, or dual system, is primarily used in electronic data processing (base 2). The two possible digits in a binary value are 0 and 1. By stringing the digits 0 and 1 together, any number can be expressed:

NumberBinary value
00
11
210
311
4100
121100
641000000
12810000000
25511111111

Octal system

The octal system uses only the digits 0 through 7 to represent numbers (base 8). The decimal number 8 thus results in the octal value 10. This system is primarily used for file permissions in Linux-based operating systems and FTP programs.

Hexadecimal system

In the hexadecimal system, the digits 0 to 9 and the letters A to F are used to represent numbers (base 16). "A" corresponds to the number 10 and "F" to the number 15. Thus, numbers from 0 to 255, which corresponds to one byte, can be expressed with a two-digit value—from "00" to "FF". Since a hexadecimal number cannot always be distinguished from a decimal number, a hexadecimal number is denoted by a leading "0x" or trailing "h", for example, 0x47 or 47h, 0x00, or 0h.

Data sizes

Bit

A bit (binary digit) is the smallest possible unit in information processing. A bit can only assume two states - 0 or 1 (see also binary system).

Nibble

A nibble consists of 4 bits (half a byte). However, this size specification is rarely used. A nibble can be represented with just one hexadecimal digit.

Byte

A byte consists of 8 bits and thus contains numbers in the range 0 to 255.

Word

A word consists of two bytes or 16 bits. Thus, numbers from 0 to 65,535 can be stored in a word. For example, the DOS time is stored in this data size to an accuracy of 2 seconds.

DWord

A DWord (Double Word) comprises four bytes or 32 bits. This results in a number range from 0 to 4,294,967,295. In addition to "normal" numbers, date and time information is also stored in this format.

QWord

A QWord (Quad Word) consists of eight bytes or 64 bits. This data size is used to store hard drive capacities or particularly precise time information, for example.

Schematic representations from bit to byte and from byte to QWord:
BitBitBitBitBitBitBitBit
NibbleNibble
Byte

ByteByteByteByteByteByteByteByte
WordWordWordWord
DWordDWord
QWord

Endianness / Byte Order

When saving or reading data, the order in which the data is processed is crucial. This can occur from left to right or vice versa. This order is called endianness or byte order. Two types are distinguished: Intel and Motorola byte order.

While DWord values that express a number are usually stored in Intel format, positional information, such as file offsets, is usually expressed in Motorola format.

Intel / LSB / Little Endian

The Intel byte order is also called LSB (least significant byte first) or little endian. In the Intel byte order, the least significant byte is in the left position and the most significant byte is in the right position. For example, word 0x0100 results in the value 1 and word 0x0001 in the value 256.

Motorola / MSB / Big Endian

The Motorola byte order is also called MSB (most significant byte first) or big endian. In Motorola byte order, the least significant byte is in the rightmost position and the most significant byte is in the leftmost position. Here, word 0x0001 results in the value 1 and word 0x0100 in the value 256.

Character sets and character encodings

A character set specifies which characters can be used. These include letters, numbers, punctuation, and special characters. In a character encoding, each character is assigned a unique number, called a character code. The character code usually starts at 0 (zero).

Types of character sets

Single-byte character sets (SBCS) store a character in one byte. Examples include ASCII, MS-DOS, and Windows code pages, as well as ISO-8859 character sets.

Multibyte character sets (MBCS) store a character in a varying number of bytes. These include UTF-8, UTF-16, and Big5.

Other character sets use a fixed number of bytes, such as 2 bytes for UCS-2 or UTF-32.

UTF-16 does not use a fixed number of bytes, but generally 2 bytes per character, but can also use 4 bytes per character.

Common character sets

ASCII

The ASCII character set contains 128 characters. Seven bits are required to store them. The 8th bit is not used in ASCII. In addition to various special characters, it primarily includes uppercase and lowercase letters, numbers, and punctuation marks.

ANSI / 8-bit character sets

8-bit character sets are also known as ANSI character sets. These extend the ASCII character set by also using the 8th bit, allowing a total of 256 characters to be represented. The additional characters are usually used for national special characters and other symbols.

MS-DOS codepages
437English
708Arabic (ASMO)
720Arabic (Microsoft)
737Greek
775Baltic
850Western European
852Central European
855Cyrillic
857Turkish
858Western European with Euro
860Portuguese
861Icelandic
862Hebrew
863Canadian French
864Arabic (IBM)
865Nordic
866Russian
869Greek
Windows codepages
874Thai
932Japanese
936Simplified Chinese
949Korean
950Traditional Chinese
1200Unicode UTF-16 LE
1201Unicode UTF-16 BE
1250Central European
1251Cyrillic
1252Western European
1253Greek
1254Turkish
1255Hebrew
1256Arabic
1257Baltic
1258Vietnamese
12000Unicode UTF-32 LE
12001Unicode UTF-32 BE
65000Unicode UTF-7
65001Unicode UTF-8
ISO 8859
ISO 8859-1Latin-1, Western European
ISO 8859-2Latin-2, Central European
ISO 8859-3Latin-3, Southern European
ISO 8859-4Latin-4, Northern European
ISO 8859-5Cyrillic
ISO 8859-6Arabic
ISO 8859-7Greek
ISO 8859-8Hebrew
ISO 8859-9Latin-5, Turkish
ISO 8859-10Latin-6, Nordic
ISO 8859-11Thai
ISO 8859-12(not assigned)
ISO 8859-13Latin-7, Baltic
ISO 8859-14Latin-8, Celtic
ISO 8859-15Latin-9, Western European
ISO 8859-16Latin-10, Southeastern European

Unicode

Unicode is a character set that uses a variable number of bytes per character. The Unicode character set primarily uses the UTF-8 and UTF-16 encodings to store characters. UTF stands for Unicode Transformation Format.

UTF-8

UTF-8 uses a sequence of 8-bit numbers (one byte) to store characters. ASCII characters are stored unchanged. This means that letters, numbers, and punctuation marks only require one byte per character in UTF-8 encoding. All other characters, such as national symbols and characters from other writing systems, require two to four bytes per character.

In binary notation, all 1-byte characters begin with a zero. For multi-byte characters, the first byte begins with binary ones, followed by a binary zero. The number of binary ones corresponds to the number of bytes per character.

Examples:
Umlaut A (Ä)11000011 10000100
Euro character (€)11100010 10000010 10101100
Teddy bear (U+1F9F8)11110000 10011111 10100111 10111000

If a file saved in UTF-8 is interpreted as ISO 8859-1, UTF-8 characters will not be displayed correctly. For example, the German umlauts "ä ö ü" would be displayed as "ä ö ü". If text in ISO 8859-1 is displayed as UTF-8, a question mark or the "�" character is displayed for all invalid characters.

UTF-16

UTF-16 uses a sequence of 16-bit numbers (two bytes or 1 word) to store characters. All characters from U+0000 to U+FFFF occupy 2 bytes.

The characters U+10000 to U+10FFFF are each stored in 4 bytes. The number 65536 (0xFFFF) is subtracted from the character code. The resulting 20-bit number is divided into two 10-bit blocks. The first block, with the 10 most significant bits, is preceded by the bit sequence 110110. These 16 bits are called high surrogates. The second block, with the 10 least significant bits, is preceded by the bit sequence 110111. These 16 bits are called low surrogates.

20 bits (0 to 19)
191817161514131211109876543210
1st block / higher-order bits2nd block / low-order bits

110110 + bits of the 1st block = high surrogates
110111 + bits of the 2nd block = low surrogates

Endianness / Byte Order

The byte order (endianness) determines the position of the 8 bits of an ASCII character within the 16 bits of the UTF-16 word.

With Big Endian (UTF-16BE), a 0 byte is prefixed to the ASCII character.
With Little Endian (UTF-16LE), a 0 byte is appended to the ASCII character.

The high surrogate word always precedes the low surrogate word, regardless of the byte order.

Byte Order Mark

The Byte Order Mark (BOM for short) is located at the beginning of a file and indicates the character encoding used in the subsequent text.

KodierungByte Order Mark hexadezimal
UTF-8EF BB BF
UTF-16 BEFE FF
UTF-16 LEFF FE
UTF-32 BE00 00 FE FF
UTF-32 LEFF FE 00 00

Tip:
The pages "ASCII and ANSI Character Table" and "Unicode Character Tables" contain character tables for ASCII characters and the most important Unicode characters.

Encoding

In certain situations, binary data must be converted into other characters. This can be for technical reasons, such as data transmission over the Internet, or for practical reasons, such as with a hexadecimal editor.

Conversion to another number system

Encoding is usually just the conversion from one number system to another. For example, from a decimal ASCII/ANSI character code to the hexadecimal system.

The decimal character codes of the word "Test" would be represented as follows in different number systems:

Number systemBaseRepresentation of the word "test"
Decimal1084 101 115 116
Hexadecimal1654 65 73 74
Octal8124 145 163 164
Binary201010100 01100101 01110011 01110100

Bei der hexadezimalen Schreibweise werden manchmal die Trennzeichen weggelassen, z.B. "54657374".

Base64

Base64 (B64) is a very common encoding. The base of this encoding is 64, which means it uses 64 characters to represent data. These include all lowercase and uppercase letters, the numbers 0 through 9, the plus sign (+), and the forward slash (/). During encoding, each 3 byte of binary data is converted into 4 coded characters. If the result is not a multiple of 4, the remaining characters are padded with an equal sign (=).

Depending on the application, the Base64 alphabet, especially the order of letters and numbers, can vary.

Example:
This is a Test

As Base64, this results in:

VGhpcyBpcyBhIFRlc3Q=

Quoted-Printable

Quoted-Printable (QP) is mostly used on the internet, for example, in e-mails. It encodes some ASCII characters (0-127) and all characters from 128-255. The encoding is preceded by an equal sign (=) followed by the character's hexadecimal value. An equal sign at the end of a line masks a line break inserted during encoding and is removed during decoding.

Example:
Ein großer Test mit äöü.
5 + 10 = 15
Leerzeichen am Ende

Encoded as a quoted printable, this results in:

Ein gro=DFer Test=
 mit =E4=F6=FC.
5 + 10 =3D 15
Leerzeichen am Ende=20

URL encoding

To pass reserved or invalid characters in a URL, they are encoded using a percent sign followed by the hexadecimal value of that character.

Example:
.../ein test/äöü/groß/

Resulting in URL encoding:

.../ein%20test/%C3%A4%C3%B6%C3%BC/gro%C3%9F/

Date formats and time formats

DOS date / DOS time

Under DOS (FAT-16), a date value and a time value are each stored in 2 bytes (1 WORD or 16 bits). The bits are divided as follows:

DOS date:
BitsDescription
0 - 4Day of the month (1 to 31)
5 - 8Month (1 = January, 2 = February, etc.)
9 - 15Year offset starting from 1980 (0 = 1980, 1 = 1981, etc.)
DOS time:
BitsBeschreibung
0 - 4Seconds divided by 2
5 - 10Minutes (0 to 59)
11 - 15Hours (0 to 23)

Possible date range: 1980-01-01 to 299593-12-31

Unix timestamp

The Unix timestamp, also known as POSIX time, Unix epoch, C time, or time_t, is stored as a DWORD (4 bytes or 32 bits) and is used specifically in the UNIX environment. It indicates the number of seconds that have passed since 00:00:00 on January 1, 1970, and is usually expressed in UTC time.

Possible date range: 1970-01-01 00:00:00 to 2038-01-19 03:14:07 for 32-bit systems

Windows FILETIME

The Windows FILETIME is stored as a structure consisting of two DWORD values (32-bit unsigned integer) (8 bytes and 64 bits respectively) and corresponds to the number of intervals of 100 nanoseconds since January 1, 1601 (UTC).

Possible date range: 1601-01-01 00:00:00 to 30828-11-11

Windows SYSTEMTIME

The Windows SYSTEMTIME is stored as a structure consisting of 8 WORD values. It contains each date and time component (year, month, day, day of the week, hour, minute, second, and millisecond) as a separate 16-bit integer. The time is specified either in the local time zone or in UTC, depending on the function called.

Possible date range: 1601-01-01 00:00:00.000 to 30827-12-31 23:59:59.999

Mach Absolute Time

Mach Absolute Time, also known as Mac Absolute Time, Apple Cocoa Core Data Timestamp, or CFAbsoluteTime, represents a point in time relative to the absolute reference time, January 1, 2001, 00:00:00 GMT. The time value is stored as a 32-bit signed integer. Time is measured by the number of seconds between the reference time and the specified time. Negative values indicate a time before the reference time, positive values indicate a time after the reference time.

ISO 8601

The ISO 8601 time format is a text-based representation of a date and time value. The format is "YYYY-MM-DD hh:mm:ss" (4-digit year, followed by the 2-digit month and day, as well as the hours, minutes, and seconds).

Possible date range: 0001-01-01 00:00:00 to 9999-12-31 23:59:59

Slack

File slack, RAM slack and drive slack

The storage space of a data storage device (a so-called block-oriented mass storage device) is divided into sectors. Typically, hard drives use 512 bytes, CDs and DVDs 2048 bytes, and newer hard drives and SSDs (solid-state drives) 4096 bytes per sector.

Clusters are groups of one to 128 sectors. Clusters are used to store files, with a file always starting at the beginning of a cluster.

The following table assumes that a cluster consists of 8 sectors and that the file contents are not a multiple of the sector size and end within the 6th sector. Since data is written to the storage device sector by sector, the missing bytes must be filled up to the end of the sector. Previously, random RAM contents were used for this purpose, which is why it was called "RAM slack". The Drive Slack is usually not overwritten and usually still contains the data that was located at this location before the file was saved and was previously associated with a file that has since been deleted.

Sectors 1 - 5 Sector 6 Sector 7 Sector 8
File content Last part of
file content
RAM slack Drive slack
File slack

File slack encompasses the area from the end of the file to the end of the last used cluster of the file (RAM slack + drive slack).

File slack is sometimes referred to as file offset or slack space.

MFT slack

In NTFS, information about a file is stored in a file record. This contains data such as the file name, file times, file size, and position in the file system. At least 1 KB or the cluster size (usually 4 KB) is reserved for a file record. If the file to be stored is not larger than the unused space in the file record, the contents are stored directly in the file record. If the file is enlarged and there is no more space for it in the file record, the file is stored in the data area of the partition. The original file contents can then remain in the file record.

Partition slack

Partition slack is the unused space between two partitions and after the last partition up to the end of the disk. If data was already stored in these areas before partitioning, it may have been retained.

Were the free content on my website helpful for you?
Support the further free publication with a donation via PayPal.

Read more about support options...