Basics

Some basic topics such as number systems, character sets and encodings are covered here.

Number systems
Data sizes
- Bit
- Nibble
- Byte
- Word
- DWord
- QWord
Endianness / Byte order
- Intel / LSB / Little Endian
- Motorola / MSB / Big Endian
Character sets and character encodings
Encoding
Encoding and Data Format Overview
Date formats and time formats
Slack
Structure of URLs

Number systems

Decimal system

The decimal system is the standard system for representing numbers. The base of the decimal system is 10 and includes the digits 0 through 9 to represent numbers.

Binary system

The binary system, or dual system, is primarily used in electronic data processing (base 2). The two possible digits in a binary value are 0 and 1. By stringing the digits 0 and 1 together, any number can be expressed:

Number	Binary value
0	0
1	1
2	10
3	11
4	100
12	1100
64	1000000
128	10000000
255	11111111

Octal system

The octal system uses only the digits 0 through 7 to represent numbers (base 8). The decimal number 8 thus results in the octal value 10. This system is primarily used for file permissions in Linux-based operating systems and FTP programs.

Hexadecimal system

In the hexadecimal system, the digits 0 to 9 and the letters A to F are used to represent numbers (base 16). "A" corresponds to the number 10 and "F" to the number 15. Thus, numbers from 0 to 255, which corresponds to one byte, can be expressed with a two-digit value—from "00" to "FF". Since a hexadecimal number cannot always be distinguished from a decimal number, a hexadecimal number is denoted by a leading "0x" or trailing "h", for example, 0x47 or 47h, 0x00, or 0h.

Data sizes

Bit

A bit (binary digit) is the smallest possible unit in information processing. A bit can only assume two states - 0 or 1 (see also binary system).

Nibble

A nibble consists of 4 bits (half a byte). However, this size specification is rarely used. A nibble can be represented with just one hexadecimal digit.

Byte

A byte consists of 8 bits and thus contains numbers in the range 0 to 255.

Word

A word consists of two bytes or 16 bits. Thus, numbers from 0 to 65,535 can be stored in a word. For example, the DOS time is stored in this data size to an accuracy of 2 seconds.

DWord

A DWord (Double Word) comprises four bytes or 32 bits. This results in a number range from 0 to 4,294,967,295. In addition to "normal" numbers, date and time information is also stored in this format.

QWord

A QWord (Quad Word) consists of eight bytes or 64 bits. This data size is used to store hard drive capacities or particularly precise time information, for example.

Schematic representations from bit to byte and from byte to QWord:

Bit	Bit	Bit	Bit	Bit	Bit	Bit	Bit
Nibble				Nibble
Byte

Byte	Byte	Byte	Byte	Byte	Byte	Byte	Byte
Word		Word		Word		Word
DWord				DWord
QWord

Endianness / Byte Order

When saving or reading data, the order in which the data is processed is crucial. This can occur from left to right or vice versa. This order is called endianness or byte order. Two types are distinguished: Intel and Motorola byte order.

While DWord values that express a number are usually stored in Intel format, positional information, such as file offsets, is usually expressed in Motorola format.

Intel / LSB / Little Endian

The Intel byte order is also called LSB (least significant byte first) or little endian. In the Intel byte order, the least significant byte is in the left position and the most significant byte is in the right position. For example, word 0x0100 results in the value 1 and word 0x0001 in the value 256.

Motorola / MSB / Big Endian

The Motorola byte order is also called MSB (most significant byte first) or big endian. In Motorola byte order, the least significant byte is in the rightmost position and the most significant byte is in the leftmost position. Here, word 0x0001 results in the value 1 and word 0x0100 in the value 256.

Character sets and character encodings

A character set specifies which characters can be used. These include letters, numbers, punctuation, and special characters. In a character encoding, each character is assigned a unique number, called a character code. The character code usually starts at 0 (zero).

Types of character sets

Single-byte character sets (SBCS) store a character in one byte. Examples include ASCII, MS-DOS, and Windows code pages, as well as ISO-8859 character sets.

Multibyte character sets (MBCS) store a character in a varying number of bytes. These include UTF-8, UTF-16, and Big5.

Other character sets use a fixed number of bytes, such as 2 bytes for UCS-2 or UTF-32.

UTF-16 does not use a fixed number of bytes, but generally 2 bytes per character, but can also use 4 bytes per character.

Common character sets

ASCII

The ASCII character set contains 128 characters. Seven bits are required to store them. The 8th bit is not used in ASCII. In addition to various special characters, it primarily includes uppercase and lowercase letters, numbers, and punctuation marks.

ANSI / 8-bit character sets

8-bit character sets are also known as ANSI character sets. These extend the ASCII character set by also using the 8th bit, allowing a total of 256 characters to be represented. The additional characters are usually used for national special characters and other symbols.

MS-DOS codepages

437	English
708	Arabic (ASMO)
720	Arabic (Microsoft)
737	Greek
775	Baltic
850	Western European
852	Central European
855	Cyrillic
857	Turkish
858	Western European with Euro
860	Portuguese
861	Icelandic
862	Hebrew
863	Canadian French
864	Arabic (IBM)
865	Nordic
866	Russian
869	Greek

Windows codepages

874	Thai
932	Japanese
936	Simplified Chinese
949	Korean
950	Traditional Chinese
1200	Unicode UTF-16 LE
1201	Unicode UTF-16 BE
1250	Central European
1251	Cyrillic
1252	Western European
1253	Greek
1254	Turkish
1255	Hebrew
1256	Arabic
1257	Baltic
1258	Vietnamese
12000	Unicode UTF-32 LE
12001	Unicode UTF-32 BE
65000	Unicode UTF-7
65001	Unicode UTF-8

ISO 8859

ISO 8859-1	Latin-1, Western European
ISO 8859-2	Latin-2, Central European
ISO 8859-3	Latin-3, Southern European
ISO 8859-4	Latin-4, Northern European
ISO 8859-5	Cyrillic
ISO 8859-6	Arabic
ISO 8859-7	Greek
ISO 8859-8	Hebrew
ISO 8859-9	Latin-5, Turkish
ISO 8859-10	Latin-6, Nordic
ISO 8859-11	Thai
ISO 8859-12	(not assigned)
ISO 8859-13	Latin-7, Baltic
ISO 8859-14	Latin-8, Celtic
ISO 8859-15	Latin-9, Western European
ISO 8859-16	Latin-10, Southeastern European

Unicode

Unicode is a character set that uses a variable number of bytes per character. The Unicode character set primarily uses the UTF-8 and UTF-16 encodings to store characters. UTF stands for Unicode Transformation Format.

UTF-8

UTF-8 uses a sequence of 8-bit numbers (one byte) to store characters. ASCII characters are stored unchanged. This means that letters, numbers, and punctuation marks only require one byte per character in UTF-8 encoding. All other characters, such as national symbols and characters from other writing systems, require two to four bytes per character.

In binary notation, all 1-byte characters begin with a zero. For multi-byte characters, the first byte begins with binary ones, followed by a binary zero. The number of binary ones corresponds to the number of bytes per character.

Examples:

Umlaut A (Ä)	11000011 10000100
Euro character (€)	11100010 10000010 10101100
Teddy bear (U+1F9F8)	11110000 10011111 10100111 10111000

If a file saved in UTF-8 is interpreted as ISO 8859-1, UTF-8 characters will not be displayed correctly. For example, the German umlauts "ä ö ü" would be displayed as "Ã¤ Ã¶ Ã¼". If text in ISO 8859-1 is displayed as UTF-8, a question mark or the "�" character is displayed for all invalid characters.

UTF-16

UTF-16 uses a sequence of 16-bit numbers (two bytes or 1 word) to store characters. All characters from U+0000 to U+FFFF occupy 2 bytes.

The characters U+10000 to U+10FFFF are each stored in 4 bytes. The number 65536 (0xFFFF) is subtracted from the character code. The resulting 20-bit number is divided into two 10-bit blocks. The first block, with the 10 most significant bits, is preceded by the bit sequence 110110. These 16 bits are called high surrogates. The second block, with the 10 least significant bits, is preceded by the bit sequence 110111. These 16 bits are called low surrogates.

20 bits (0 to 19)
19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1st block / higher-order bits										2nd block / low-order bits

110110 + bits of the 1st block = high surrogates
110111 + bits of the 2nd block = low surrogates

Endianness / Byte Order

The byte order (endianness) determines the position of the 8 bits of an ASCII character within the 16 bits of the UTF-16 word.

With Big Endian (UTF-16BE), a 0 byte is prefixed to the ASCII character.
With Little Endian (UTF-16LE), a 0 byte is appended to the ASCII character.

The high surrogate word always precedes the low surrogate word, regardless of the byte order.

Byte Order Mark

The Byte Order Mark (BOM for short) is located at the beginning of a file and indicates the character encoding used in the subsequent text.

Kodierung	Byte Order Mark hexadezimal
UTF-8	EF BB BF
UTF-16 BE	FE FF
UTF-16 LE	FF FE
UTF-32 BE	00 00 FE FF
UTF-32 LE	FF FE 00 00

Tip:
The pages "ASCII and ANSI Character Table" and "Unicode Character Tables" contain character tables for ASCII characters and the most important Unicode characters.

Encoding

In certain situations, binary data must be converted into other characters. This can be for technical reasons, such as data transmission over the Internet, or for practical reasons, such as with a hexadecimal editor.

Conversion to another number system

Encoding is usually just the conversion from one number system to another. For example, from a decimal ASCII/ANSI character code to the hexadecimal system.

The decimal character codes of the word "Test" would be represented as follows in different number systems:

Number system	Base	Representation of the word "test"
Decimal	10	84 101 115 116
Hexadecimal	16	54 65 73 74
Octal	8	124 145 163 164
Binary	2	01010100 01100101 01110011 01110100

Bei der hexadezimalen Schreibweise werden manchmal die Trennzeichen weggelassen, z.B. "54657374".

Base64

Base64 (B64) is a very common encoding. The base of this encoding is 64, which means it uses 64 characters to represent data. These include all lowercase and uppercase letters, the numbers 0 through 9, the plus sign (+), and the forward slash (/). During encoding, each 3 byte of binary data is converted into 4 coded characters. If the result is not a multiple of 4, the remaining characters are padded with an equal sign (=).

Depending on the application, the Base64 alphabet, especially the order of letters and numbers, can vary.

Example:

This is a Test

As Base64, this results in:

VGhpcyBpcyBhIFRlc3Q=

Quoted-Printable

Quoted-Printable (QP) is mostly used on the internet, for example, in e-mails. It encodes some ASCII characters (0-127) and all characters from 128-255. The encoding is preceded by an equal sign (=) followed by the character's hexadecimal value. An equal sign at the end of a line masks a line break inserted during encoding and is removed during decoding.

Example:

Ein großer Test mit äöü.
5 + 10 = 15
Leerzeichen am Ende

Encoded as a quoted printable, this results in:

Ein gro=DFer Test=
 mit =E4=F6=FC.
5 + 10 =3D 15
Leerzeichen am Ende=20

URL encoding

To pass reserved or invalid characters in a URL, they are encoded using a percent sign followed by the hexadecimal value of that character.

Example:

.../ein test/äöü/groß/

Resulting in URL encoding:

.../ein%20test/%C3%A4%C3%B6%C3%BC/gro%C3%9F/

Encoding and Data Format Overview

This table contains encodings, data formats, and representations, including a brief description and an example. Based on the format and characters used, the type of encoding can sometimes be determined or at least narrowed down.

Encoding or Format	Description and Usage	Example
Base32	Data transmission as 7-bit ASCII characters. No special characters (usually A-Z, 2-7) are used.	`I5QWS2TJNYXGC5A=`
Base45	Encodes byte data using a restricted set of symbols. Used in QR codes.	`319VEDZED%%5Q2`
Base58	Data transmission as 7-bit ASCII characters. Without the use of similar characters such as I, l, 0, and O.	`uhTz6mAX8qMM`
Base62	Encodes byte data using a limited character set (0-9, A-Z, a-z).	`PIqwPVLBEtRk`
Base64	Data transmission as 7-bit ASCII characters. Usually for binary data.	`R2FpamluLmF0Cg==`
Base85	Also called Ascii85. A notation for encoding arbitrary byte data. It is usually more efficient than Base64. Used in PostScript and PDF.	`7q$+HBl5P3F8`
Base92	Encodes byte data using a restricted set of symbols.	`;+1s]]cDsutW`
Binary	Representation of character codes based on 2 (0 and 1).	`01000111 01100001 01101001 01101010 01101001 01101110`
Decimal	Representation of character code based on 10 (0-9).	`71 97 105 106 105 110 46 97 116`
Hashes	A mathematically calculated checksum for any data. It is usually represented in hexadecimal form. The length varies depending on the method.	`d47ef03e79e7eb76b4c9b6deac8b61c1`
Hexadecimal	Representation of character codes based on 16 (0-9, A-F).	`47 61 69 6A 69 6E 2E 61 74`
HTML Entities	Encoding for special characters in HTML pages.	`<Fünf> &excl;= ∆`
Octal	Representation of character codes based on 8 (0-7). Rare, except for Linux file permissions.	`107 141 151 152 151 156 56 141 164`
Punycode	Encoding of non-ASCII characters, mainly for IDNs (Internationalized Domains). Encoded part begins with "xn--".	`obst.xn--pfel-koa.example.com`
Quoted-Printable	Text transmission as 7-bit ASCII characters.	`F=FCnf =D7 drei =3D 15`
ROT13	A simple Caesar substitution cipher that shifts letters 13 positions. The number of character shifts may vary. Sometimes used to display hints or answers in a way that isn't immediately legible.	`Tnvwva.ng`
URL-Encoding	Data encoding that can be used as a parameter in a URL.	`https%3A%2F%2Fwww%2Egaijin%2Eat%2F`
Vigenère	Encrypts alphabetic text using a number of different Caesar ciphers based on the letters of a keyword.	`Maqsqa.gt`
XOR	Binary exclusive-or operation of the bit values with an arbitrary key.	`Dbjijm-bw`

Date formats and time formats

DOS date / DOS time

Under DOS (FAT-16), a date value and a time value are each stored in 2 bytes (1 WORD or 16 bits). The bits are divided as follows:

DOS date:

Bits	Description
0 - 4	Day of the month (1 to 31)
5 - 8	Month (1 = January, 2 = February, etc.)
9 - 15	Year offset starting from 1980 (0 = 1980, 1 = 1981, etc.)

DOS time:

Bits	Beschreibung
0 - 4	Seconds divided by 2
5 - 10	Minutes (0 to 59)
11 - 15	Hours (0 to 23)

Possible date range: 1980-01-01 to 299593-12-31

Unix timestamp

The Unix timestamp, also known as POSIX time, Unix epoch, C time, or time_t, is stored as a DWORD (4 bytes or 32 bits) and is used specifically in the UNIX environment. It indicates the number of seconds that have passed since 00:00:00 on January 1, 1970, and is usually expressed in UTC time.

Possible date range: 1970-01-01 00:00:00 to 2038-01-19 03:14:07 for 32-bit systems

Windows FILETIME

The Windows FILETIME is stored as a structure consisting of two DWORD values (32-bit unsigned integer) (8 bytes and 64 bits respectively) and corresponds to the number of intervals of 100 nanoseconds since January 1, 1601 (UTC).

Possible date range: 1601-01-01 00:00:00 to 30828-11-11

Windows SYSTEMTIME

The Windows SYSTEMTIME is stored as a structure consisting of 8 WORD values. It contains each date and time component (year, month, day, day of the week, hour, minute, second, and millisecond) as a separate 16-bit integer. The time is specified either in the local time zone or in UTC, depending on the function called.

Possible date range: 1601-01-01 00:00:00.000 to 30827-12-31 23:59:59.999

Mach Absolute Time

Mach Absolute Time, also known as Mac Absolute Time, Apple Cocoa Core Data Timestamp, or CFAbsoluteTime, represents a point in time relative to the absolute reference time, January 1, 2001, 00:00:00 GMT. The time value is stored as a 32-bit signed integer. Time is measured by the number of seconds between the reference time and the specified time. Negative values indicate a time before the reference time, positive values indicate a time after the reference time.

ISO 8601

The ISO 8601 time format is a text-based representation of a date and time value. The format is "YYYY-MM-DD hh:mm:ss" (4-digit year, followed by the 2-digit month and day, as well as the hours, minutes, and seconds).

Possible date range: 0001-01-01 00:00:00 to 9999-12-31 23:59:59

Slack

File slack, RAM slack and drive slack

The storage space of a data storage device (a so-called block-oriented mass storage device) is divided into sectors. Typically, hard drives use 512 bytes, CDs and DVDs 2048 bytes, and newer hard drives and SSDs (solid-state drives) 4096 bytes per sector.

Clusters are groups of one to 128 sectors. Clusters are used to store files, with a file always starting at the beginning of a cluster.

The following table assumes that a cluster consists of 8 sectors and that the file contents are not a multiple of the sector size and end within the 6th sector. Since data is written to the storage device sector by sector, the missing bytes must be filled up to the end of the sector. Previously, random RAM contents were used for this purpose, which is why it was called "RAM slack". The Drive Slack is usually not overwritten and usually still contains the data that was located at this location before the file was saved and was previously associated with a file that has since been deleted.

Sectors 1 - 5	Sector 6		Sector 7	Sector 8
File content	Last part of file content	RAM slack	Drive slack
File content	Last part of file content	File slack

File slack encompasses the area from the end of the file to the end of the last used cluster of the file (RAM slack + drive slack).

File slack is sometimes referred to as file offset or slack space.

MFT slack

In NTFS, information about a file is stored in a file record. This contains data such as the file name, file times, file size, and position in the file system. At least 1 KB or the cluster size (usually 4 KB) is reserved for a file record. If the file to be stored is not larger than the unused space in the file record, the contents are stored directly in the file record. If the file is enlarged and there is no more space for it in the file record, the file is stored in the data area of the partition. The original file contents can then remain in the file record.

Partition slack

Partition slack is the unused space between two partitions and after the last partition up to the end of the disk. If data was already stored in these areas before partitioning, it may have been retained.

Structure of URLs

A URL (Uniform Resource Locator) contains information about the protocol and the exact location of a resource on the Internet. The URL is structured differently for different protocols. In addition to http and https, ftp, whois, telnet, news, file, and mailto are also commonly used. This section primarily discusses the URL for HTTP(S).

Structure

Parts of an URL:

                authority
        ┌───────────┴────────────┐
https://tester@www.example.com:443/news/articles?page=3&lang=en#top
└─┬─┘   └─┬──┘ └──────┬──────┘ └┬┘└─────┬──────┘ └─────┬──────┘ └┬┘
scheme  userinfo     host      port    path          query     fragment

Scheme

The scheme refers to the protocol and must be specified in a URL. The scheme is followed by a colon and then the scheme-specific parts of the URL.

Examples of other protocols:

mailto:recipient@example.com
ftp:ftp.example.com
news:comp.programming
telnet:example.com
file:///directory/file.txt

User Information (userinfo)

A URL can contain a username and a password separated by a colon. For security reasons, user credentials should no longer be used for HTTP/HTTPS. At the very least, you should avoid specifying the password. A URL with just the username would be, for example, "https://tester@www.example.com". The user credentials are followed by an at sign (@).

Host

The host is the hostname, IPv4, or IPv6 address of the request destination. Specifying the host is required. The hostname itself is structured as follows:

Subdomain . Second-level domain . Top-level domain

The "domain" is usually composed of the second-level domain and the top-level domain, as in "example.com".

In some cases, the second-level domain is a ccSLD (country code second-level domain). For some country-specific top-level domains, there are special second-level domains (suffixes), such as ".co.at", ".or.at", or ".gv.at". Only certain groups or authorities can register third-level domains under these. For example, ".gv.at" is used only for government agencies.

Port

After the host, a port can be specified, separated by a colon. This is usually only required if it differs from the default port for the protocol in question.

Path

Paths begin with a slash. Then come the directories, and finally the filename. The individual parts of the path are separated by a slash. For a directory, the path ends with a slash.

A path doesn't necessarily have to consist of directories and files, even if it appears that way. Various server mechanisms can redirect requests to any file.

Parameters (query)

Parameters can be passed after the path, which must be preceded by a question mark. Multiple parameters are separated by an ampersand. Each parameter consists of a name and a value, separated by an equals sign. Example: "?parameter1=value1&parameter2=value2".

Jump Mark (fragment)

After a hash mark, the name of a jump target can be specified in the HTML document. These are defined in HTML using the "id" attribute. This part is sometimes called the "hash".

Were the free content on my website helpful for you?
Support the further free publication with a donation via PayPal.

Basics

Contents

Number systems

Decimal system

Binary system

Octal system

Hexadecimal system

Data sizes

Bit

Nibble

Byte

Word

DWord

QWord

Schematic representations from bit to byte and from byte to QWord:

Endianness / Byte Order

Intel / LSB / Little Endian

Motorola / MSB / Big Endian

Character sets and character encodings

Types of character sets

Common character sets

ASCII

ANSI / 8-bit character sets

MS-DOS codepages

Windows codepages

ISO 8859

Unicode

UTF-8

Examples:

UTF-16

Endianness / Byte Order

Byte Order Mark

Encoding

Conversion to another number system

Base64

Example:

Quoted-Printable

Example:

URL encoding

Example:

Encoding and Data Format Overview

Date formats and time formats

DOS date / DOS time

DOS date:

DOS time:

Unix timestamp

Windows FILETIME

Windows SYSTEMTIME

Mach Absolute Time

ISO 8601

Slack

File slack, RAM slack and drive slack

MFT slack

Partition slack

Structure of URLs

Structure

Scheme

User Information (userinfo)

Host

Port

Path

Parameters (query)

Jump Mark (fragment)