Basics
Some basic topics such as number systems, character sets and encodings are covered here.
Contents
- Number systems
- Data sizes
- Endianness / Byte order
- Character sets and character encodings
- Encoding
- Encoding and Data Format Overview
- Date formats and time formats
- Slack
- Structure of URLs
Number systems
Decimal system
The decimal system is the standard system for representing numbers. The base of the decimal system is 10 and includes the digits 0 through 9 to represent numbers.
Binary system
The binary system, or dual system, is primarily used in electronic data processing (base 2). The two possible digits in a binary value are 0 and 1. By stringing the digits 0 and 1 together, any number can be expressed:
Number | Binary value |
---|---|
0 | 0 |
1 | 1 |
2 | 10 |
3 | 11 |
4 | 100 |
12 | 1100 |
64 | 1000000 |
128 | 10000000 |
255 | 11111111 |
Octal system
The octal system uses only the digits 0 through 7 to represent numbers (base 8). The decimal number 8 thus results in the octal value 10. This system is primarily used for file permissions in Linux-based operating systems and FTP programs.
Hexadecimal system
In the hexadecimal system, the digits 0 to 9 and the letters A to F are used to represent numbers (base 16). "A" corresponds to the number 10 and "F" to the number 15. Thus, numbers from 0 to 255, which corresponds to one byte, can be expressed with a two-digit value—from "00" to "FF". Since a hexadecimal number cannot always be distinguished from a decimal number, a hexadecimal number is denoted by a leading "0x" or trailing "h", for example, 0x47 or 47h, 0x00, or 0h.
Data sizes
Bit
A bit (binary digit) is the smallest possible unit in information processing. A bit can only assume two states - 0 or 1 (see also binary system).
Nibble
A nibble consists of 4 bits (half a byte). However, this size specification is rarely used. A nibble can be represented with just one hexadecimal digit.
Byte
A byte consists of 8 bits and thus contains numbers in the range 0 to 255.
Word
A word consists of two bytes or 16 bits. Thus, numbers from 0 to 65,535 can be stored in a word. For example, the DOS time is stored in this data size to an accuracy of 2 seconds.
DWord
A DWord (Double Word) comprises four bytes or 32 bits. This results in a number range from 0 to 4,294,967,295. In addition to "normal" numbers, date and time information is also stored in this format.
QWord
A QWord (Quad Word) consists of eight bytes or 64 bits. This data size is used to store hard drive capacities or particularly precise time information, for example.
Schematic representations from bit to byte and from byte to QWord:
Bit | Bit | Bit | Bit | Bit | Bit | Bit | Bit |
Nibble | Nibble | ||||||
Byte |
Byte | Byte | Byte | Byte | Byte | Byte | Byte | Byte |
Word | Word | Word | Word | ||||
DWord | DWord | ||||||
QWord |
Endianness / Byte Order
When saving or reading data, the order in which the data is processed is crucial. This can occur from left to right or vice versa. This order is called endianness or byte order. Two types are distinguished: Intel and Motorola byte order.
While DWord values that express a number are usually stored in Intel format, positional information, such as file offsets, is usually expressed in Motorola format.
Intel / LSB / Little Endian
The Intel byte order is also called LSB (least significant byte first) or little endian. In the Intel byte order, the least significant byte is in the left position and the most significant byte is in the right position. For example, word 0x0100 results in the value 1 and word 0x0001 in the value 256.
Motorola / MSB / Big Endian
The Motorola byte order is also called MSB (most significant byte first) or big endian. In Motorola byte order, the least significant byte is in the rightmost position and the most significant byte is in the leftmost position. Here, word 0x0001 results in the value 1 and word 0x0100 in the value 256.
Character sets and character encodings
A character set specifies which characters can be used. These include letters, numbers, punctuation, and special characters. In a character encoding, each character is assigned a unique number, called a character code. The character code usually starts at 0 (zero).
Types of character sets
Single-byte character sets (SBCS) store a character in one byte. Examples include ASCII, MS-DOS, and Windows code pages, as well as ISO-8859 character sets.
Multibyte character sets (MBCS) store a character in a varying number of bytes. These include UTF-8, UTF-16, and Big5.
Other character sets use a fixed number of bytes, such as 2 bytes for UCS-2 or UTF-32.
UTF-16 does not use a fixed number of bytes, but generally 2 bytes per character, but can also use 4 bytes per character.
Common character sets
ASCII
The ASCII character set contains 128 characters. Seven bits are required to store them. The 8th bit is not used in ASCII. In addition to various special characters, it primarily includes uppercase and lowercase letters, numbers, and punctuation marks.
ANSI / 8-bit character sets
8-bit character sets are also known as ANSI character sets. These extend the ASCII character set by also using the 8th bit, allowing a total of 256 characters to be represented. The additional characters are usually used for national special characters and other symbols.
MS-DOS codepages
437 | English |
708 | Arabic (ASMO) |
720 | Arabic (Microsoft) |
737 | Greek |
775 | Baltic |
850 | Western European |
852 | Central European |
855 | Cyrillic |
857 | Turkish |
858 | Western European with Euro |
860 | Portuguese |
861 | Icelandic |
862 | Hebrew |
863 | Canadian French |
864 | Arabic (IBM) |
865 | Nordic |
866 | Russian |
869 | Greek |
Windows codepages
874 | Thai |
932 | Japanese |
936 | Simplified Chinese |
949 | Korean |
950 | Traditional Chinese |
1200 | Unicode UTF-16 LE |
1201 | Unicode UTF-16 BE |
1250 | Central European |
1251 | Cyrillic |
1252 | Western European |
1253 | Greek |
1254 | Turkish |
1255 | Hebrew |
1256 | Arabic |
1257 | Baltic |
1258 | Vietnamese |
12000 | Unicode UTF-32 LE |
12001 | Unicode UTF-32 BE |
65000 | Unicode UTF-7 |
65001 | Unicode UTF-8 |
ISO 8859
ISO 8859-1 | Latin-1, Western European |
ISO 8859-2 | Latin-2, Central European |
ISO 8859-3 | Latin-3, Southern European |
ISO 8859-4 | Latin-4, Northern European |
ISO 8859-5 | Cyrillic |
ISO 8859-6 | Arabic |
ISO 8859-7 | Greek |
ISO 8859-8 | Hebrew |
ISO 8859-9 | Latin-5, Turkish |
ISO 8859-10 | Latin-6, Nordic |
ISO 8859-11 | Thai |
ISO 8859-12 | (not assigned) |
ISO 8859-13 | Latin-7, Baltic |
ISO 8859-14 | Latin-8, Celtic |
ISO 8859-15 | Latin-9, Western European |
ISO 8859-16 | Latin-10, Southeastern European |
Unicode
Unicode is a character set that uses a variable number of bytes per character. The Unicode character set primarily uses the UTF-8 and UTF-16 encodings to store characters. UTF stands for Unicode Transformation Format.
UTF-8
UTF-8 uses a sequence of 8-bit numbers (one byte) to store characters. ASCII characters are stored unchanged. This means that letters, numbers, and punctuation marks only require one byte per character in UTF-8 encoding. All other characters, such as national symbols and characters from other writing systems, require two to four bytes per character.
In binary notation, all 1-byte characters begin with a zero. For multi-byte characters, the first byte begins with binary ones, followed by a binary zero. The number of binary ones corresponds to the number of bytes per character.
Examples:
Umlaut A (Ä) | 11000011 10000100 |
---|---|
Euro character (€) | 11100010 10000010 10101100 |
Teddy bear (U+1F9F8) | 11110000 10011111 10100111 10111000 |
If a file saved in UTF-8 is interpreted as ISO 8859-1, UTF-8 characters will not be displayed correctly. For example, the German umlauts "ä ö ü" would be displayed as "ä ö ü". If text in ISO 8859-1 is displayed as UTF-8, a question mark or the "�" character is displayed for all invalid characters.
UTF-16
UTF-16 uses a sequence of 16-bit numbers (two bytes or 1 word) to store characters. All characters from U+0000 to U+FFFF occupy 2 bytes.
The characters U+10000 to U+10FFFF are each stored in 4 bytes. The number 65536 (0xFFFF) is subtracted from the character code. The resulting 20-bit number is divided into two 10-bit blocks. The first block, with the 10 most significant bits, is preceded by the bit sequence 110110. These 16 bits are called high surrogates. The second block, with the 10 least significant bits, is preceded by the bit sequence 110111. These 16 bits are called low surrogates.
20 bits (0 to 19) | |||||||||||||||||||
19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
1st block / higher-order bits | 2nd block / low-order bits |
110110 + bits of the 1st block = high surrogates
110111 + bits of the 2nd block = low surrogates
Endianness / Byte Order
The byte order (endianness) determines the position of the 8 bits of an ASCII character within the 16 bits of the UTF-16 word.
With Big Endian (UTF-16BE), a 0 byte is prefixed to the ASCII character.
With Little Endian (UTF-16LE), a 0 byte is appended to the ASCII character.
The high surrogate word always precedes the low surrogate word, regardless of the byte order.
Byte Order Mark
The Byte Order Mark (BOM for short) is located at the beginning of a file and indicates the character encoding used in the subsequent text.
Kodierung | Byte Order Mark hexadezimal |
---|---|
UTF-8 | EF BB BF |
UTF-16 BE | FE FF |
UTF-16 LE | FF FE |
UTF-32 BE | 00 00 FE FF |
UTF-32 LE | FF FE 00 00 |
Tip:
The pages "ASCII and ANSI Character Table" and "Unicode Character Tables" contain character tables for ASCII characters and the most important Unicode characters.
Encoding
In certain situations, binary data must be converted into other characters. This can be for technical reasons, such as data transmission over the Internet, or for practical reasons, such as with a hexadecimal editor.
Conversion to another number system
Encoding is usually just the conversion from one number system to another. For example, from a decimal ASCII/ANSI character code to the hexadecimal system.
The decimal character codes of the word "Test" would be represented as follows in different number systems:
Number system | Base | Representation of the word "test" |
---|---|---|
Decimal | 10 | 84 101 115 116 |
Hexadecimal | 16 | 54 65 73 74 |
Octal | 8 | 124 145 163 164 |
Binary | 2 | 01010100 01100101 01110011 01110100 |
Bei der hexadezimalen Schreibweise werden manchmal die Trennzeichen weggelassen, z.B. "54657374".
Base64
Base64 (B64) is a very common encoding. The base of this encoding is 64, which means it uses 64 characters to represent data. These include all lowercase and uppercase letters, the numbers 0 through 9, the plus sign (+), and the forward slash (/). During encoding, each 3 byte of binary data is converted into 4 coded characters. If the result is not a multiple of 4, the remaining characters are padded with an equal sign (=).
Depending on the application, the Base64 alphabet, especially the order of letters and numbers, can vary.
Example:
This is a Test
As Base64, this results in:
VGhpcyBpcyBhIFRlc3Q=
Quoted-Printable
Quoted-Printable (QP) is mostly used on the internet, for example, in e-mails. It encodes some ASCII characters (0-127) and all characters from 128-255. The encoding is preceded by an equal sign (=) followed by the character's hexadecimal value. An equal sign at the end of a line masks a line break inserted during encoding and is removed during decoding.
Example:
Ein großer Test mit äöü. 5 + 10 = 15 Leerzeichen am Ende
Encoded as a quoted printable, this results in:
Ein gro=DFer Test= mit =E4=F6=FC. 5 + 10 =3D 15 Leerzeichen am Ende=20
URL encoding
To pass reserved or invalid characters in a URL, they are encoded using a percent sign followed by the hexadecimal value of that character.
Example:
.../ein test/äöü/groß/
Resulting in URL encoding:
.../ein%20test/%C3%A4%C3%B6%C3%BC/gro%C3%9F/
Encoding and Data Format Overview
This table contains encodings, data formats, and representations, including a brief description and an example. Based on the format and characters used, the type of encoding can sometimes be determined or at least narrowed down.
Encoding or Format | Description and Usage | Example |
---|---|---|
Base32 | Data transmission as 7-bit ASCII characters. No special characters (usually A-Z, 2-7) are used. | I5QWS2TJNYXGC5A= |
Base45 | Encodes byte data using a restricted set of symbols. Used in QR codes. | 319VEDZED%%5Q2 |
Base58 | Data transmission as 7-bit ASCII characters. Without the use of similar characters such as I, l, 0, and O. | uhTz6mAX8qMM |
Base62 | Encodes byte data using a limited character set (0-9, A-Z, a-z). | PIqwPVLBEtRk |
Base64 | Data transmission as 7-bit ASCII characters. Usually for binary data. | R2FpamluLmF0Cg== |
Base85 | Also called Ascii85. A notation for encoding arbitrary byte data. It is usually more efficient than Base64. Used in PostScript and PDF. | 7q$+HBl5P3F8 |
Base92 | Encodes byte data using a restricted set of symbols. | ;+1s]]cDsutW |
Binary | Representation of character codes based on 2 (0 and 1). | 01000111 01100001 01101001 01101010 01101001 01101110 |
Decimal | Representation of character code based on 10 (0-9). | 71 97 105 106 105 110 46 97 116 |
Hashes | A mathematically calculated checksum for any data. It is usually represented in hexadecimal form. The length varies depending on the method. | d47ef03e79e7eb76b4c9b6deac8b61c1 |
Hexadecimal | Representation of character codes based on 16 (0-9, A-F). | 47 61 69 6A 69 6E 2E 61 74 |
HTML Entities | Encoding for special characters in HTML pages. | <Fünf> != ∆ |
Octal | Representation of character codes based on 8 (0-7). Rare, except for Linux file permissions. | 107 141 151 152 151 156 56 141 164 |
Punycode | Encoding of non-ASCII characters, mainly for IDNs (Internationalized Domains). Encoded part begins with "xn--". | obst.xn--pfel-koa.example.com |
Quoted-Printable | Text transmission as 7-bit ASCII characters. | F=FCnf =D7 drei =3D 15 |
ROT13 | A simple Caesar substitution cipher that shifts letters 13 positions. The number of character shifts may vary. Sometimes used to display hints or answers in a way that isn't immediately legible. | Tnvwva.ng |
URL-Encoding | Data encoding that can be used as a parameter in a URL. | https%3A%2F%2Fwww%2Egaijin%2Eat%2F |
Vigenère | Encrypts alphabetic text using a number of different Caesar ciphers based on the letters of a keyword. | Maqsqa.gt |
XOR | Binary exclusive-or operation of the bit values with an arbitrary key. | Dbjijm-bw |
Date formats and time formats
DOS date / DOS time
Under DOS (FAT-16), a date value and a time value are each stored in 2 bytes (1 WORD or 16 bits). The bits are divided as follows:
DOS date:
Bits | Description |
---|---|
0 - 4 | Day of the month (1 to 31) |
5 - 8 | Month (1 = January, 2 = February, etc.) |
9 - 15 | Year offset starting from 1980 (0 = 1980, 1 = 1981, etc.) |
DOS time:
Bits | Beschreibung |
---|---|
0 - 4 | Seconds divided by 2 |
5 - 10 | Minutes (0 to 59) |
11 - 15 | Hours (0 to 23) |
Possible date range: 1980-01-01 to 299593-12-31
Unix timestamp
The Unix timestamp, also known as POSIX time, Unix epoch, C time, or time_t, is stored as a DWORD (4 bytes or 32 bits) and is used specifically in the UNIX environment. It indicates the number of seconds that have passed since 00:00:00 on January 1, 1970, and is usually expressed in UTC time.
Possible date range: 1970-01-01 00:00:00 to 2038-01-19 03:14:07 for 32-bit systems
Windows FILETIME
The Windows FILETIME is stored as a structure consisting of two DWORD values (32-bit unsigned integer) (8 bytes and 64 bits respectively) and corresponds to the number of intervals of 100 nanoseconds since January 1, 1601 (UTC).
Possible date range: 1601-01-01 00:00:00 to 30828-11-11
Windows SYSTEMTIME
The Windows SYSTEMTIME is stored as a structure consisting of 8 WORD values. It contains each date and time component (year, month, day, day of the week, hour, minute, second, and millisecond) as a separate 16-bit integer. The time is specified either in the local time zone or in UTC, depending on the function called.
Possible date range: 1601-01-01 00:00:00.000 to 30827-12-31 23:59:59.999
Mach Absolute Time
Mach Absolute Time, also known as Mac Absolute Time, Apple Cocoa Core Data Timestamp, or CFAbsoluteTime, represents a point in time relative to the absolute reference time, January 1, 2001, 00:00:00 GMT. The time value is stored as a 32-bit signed integer. Time is measured by the number of seconds between the reference time and the specified time. Negative values indicate a time before the reference time, positive values indicate a time after the reference time.
ISO 8601
The ISO 8601 time format is a text-based representation of a date and time value. The format is "YYYY-MM-DD hh:mm:ss" (4-digit year, followed by the 2-digit month and day, as well as the hours, minutes, and seconds).
Possible date range: 0001-01-01 00:00:00 to 9999-12-31 23:59:59
Slack
File slack, RAM slack and drive slack
The storage space of a data storage device (a so-called block-oriented mass storage device) is divided into sectors. Typically, hard drives use 512 bytes, CDs and DVDs 2048 bytes, and newer hard drives and SSDs (solid-state drives) 4096 bytes per sector.
Clusters are groups of one to 128 sectors. Clusters are used to store files, with a file always starting at the beginning of a cluster.
The following table assumes that a cluster consists of 8 sectors and that the file contents are not a multiple of the sector size and end within the 6th sector. Since data is written to the storage device sector by sector, the missing bytes must be filled up to the end of the sector. Previously, random RAM contents were used for this purpose, which is why it was called "RAM slack". The Drive Slack is usually not overwritten and usually still contains the data that was located at this location before the file was saved and was previously associated with a file that has since been deleted.
Sectors 1 - 5 | Sector 6 | Sector 7 | Sector 8 | |
---|---|---|---|---|
File content | Last part of file content |
RAM slack | Drive slack | |
File slack |
File slack encompasses the area from the end of the file to the end of the last used cluster of the file (RAM slack + drive slack).
File slack is sometimes referred to as file offset or slack space.
MFT slack
In NTFS, information about a file is stored in a file record. This contains data such as the file name, file times, file size, and position in the file system. At least 1 KB or the cluster size (usually 4 KB) is reserved for a file record. If the file to be stored is not larger than the unused space in the file record, the contents are stored directly in the file record. If the file is enlarged and there is no more space for it in the file record, the file is stored in the data area of the partition. The original file contents can then remain in the file record.
Partition slack
Partition slack is the unused space between two partitions and after the last partition up to the end of the disk. If data was already stored in these areas before partitioning, it may have been retained.
Structure of URLs
A URL (Uniform Resource Locator) contains information about the protocol and the exact location of a resource on the Internet. The URL is structured differently for different protocols. In addition to http and https, ftp, whois, telnet, news, file, and mailto are also commonly used. This section primarily discusses the URL for HTTP(S).
Structure
authority ┌───────────┴────────────┐ https://tester@www.example.com:443/news/articles?page=3&lang=en#top └─┬─┘ └─┬──┘ └──────┬──────┘ └┬┘└─────┬──────┘ └─────┬──────┘ └┬┘ scheme userinfo host port path query fragment
Scheme
The scheme refers to the protocol and must be specified in a URL. The scheme is followed by a colon and then the scheme-specific parts of the URL.
Examples of other protocols:
- mailto:recipient@example.com
- ftp:ftp.example.com
- news:comp.programming
- telnet:example.com
- file:///directory/file.txt
User Information (userinfo)
A URL can contain a username and a password separated by a colon. For security reasons, user credentials should no longer be used for HTTP/HTTPS. At the very least, you should avoid specifying the password. A URL with just the username would be, for example, "https://tester@www.example.com". The user credentials are followed by an at sign (@).
Host
The host is the hostname, IPv4, or IPv6 address of the request destination. Specifying the host is required. The hostname itself is structured as follows:
Subdomain . Second-level domain . Top-level domain
The "domain" is usually composed of the second-level domain and the top-level domain, as in "example.com".
In some cases, the second-level domain is a ccSLD (country code second-level domain). For some country-specific top-level domains, there are special second-level domains (suffixes), such as ".co.at", ".or.at", or ".gv.at". Only certain groups or authorities can register third-level domains under these. For example, ".gv.at" is used only for government agencies.
Port
After the host, a port can be specified, separated by a colon. This is usually only required if it differs from the default port for the protocol in question.
Path
Paths begin with a slash. Then come the directories, and finally the filename. The individual parts of the path are separated by a slash. For a directory, the path ends with a slash.
A path doesn't necessarily have to consist of directories and files, even if it appears that way. Various server mechanisms can redirect requests to any file.
Parameters (query)
Parameters can be passed after the path, which must be preceded by a question mark. Multiple parameters are separated by an ampersand. Each parameter consists of a name and a value, separated by an equals sign. Example: "?parameter1=value1¶meter2=value2".
Jump Mark (fragment)
After a hash mark, the name of a jump target can be specified in the HTML document. These are defined in HTML using the "id" attribute. This part is sometimes called the "hash".
Were the free content on my website helpful for you?
Support the further free publication with a donation via PayPal.