4.1 Introduction: what is a byte?

A computer cannot store “numbers” or “letters”. The only thing a computer can store and work with is bits. A bit is binary, it is either a \(0\) or a \(1\). In fact from a physics perspective, a bit is just a blip of electricity that either is or isn’t there.

In the past the ASCII character set dominated computing. This set defines \(128\) characters including \(0\) to \(9\), upper and lower case alpha-numeric and a few control characters such as a new line. To store these characters required \(7\) bits since \(2^7 = 128\), but \(8\) bits were typically used for performance reasons. Table 3.1 gives the binary representation of the first few characters.

Bit representation Character
\(01000001\) A
\(01000010\) B
\(01000011\) C
\(01000100\) D
\(01000101\) E
\(01010010\) R

Table 3.1: The bit representation of a few ASCII characters.

The limitation of only gives having \(256\) characters led to the development of Unicode, a standard framework aimed at creating a single character set for every reasonable writing system. Typically, Unicode characters require sixteen bits of storage.

Eight bits is one byte, or ASCII character. So two ASCII characters would use two bytes or \(16\) bits. A pure text document containing \(100\) characters would use \(100\) bytes (\(800\) bits). Note that mark-up, such as font information or meta-data, can impose a substantial memory overhead: an empty .docx file requires about \(3,700\) bytes of storage.

When computer scientists first started to think about computer memory, they noticed that \(2^{10} = 1024 \simeq 10^3\) and \(2^{20} =1,048,576\simeq 10^6\), so they adopted the short hand of kilo- and mega-bytes. Of course, everyone knew that it was just a short hand, and it was really a binary power. When computers became more wide spread, foolish people like you and me just assumed that kilo actually meant \(10^3\) bytes.

Fortunately the IEEE Standards Board intervened and created conventional, internationally adopted definitions of the International System of Units (SI) prefixes. So a kilobyte (KB) is \(10^3 = 1000\) bytes and a megabyte (MB) is \(10^6\) bytes or \(10^3\) kilobytes (see table 3.2). A petabyte is approximately \(100\) million drawers filled with text. Astonishingly Google processes around \(20\) petabytes of data every day.

Factor Name Symbol Origin Derivation
\(2^{10}\) kibi Ki Kilobinary: \((2^{10})^1\)
\(2^{20}\) mebi Mi Megabinary: \((2^{10})^2\)
\(2^{30}\) gibi Gi Gigabinary: \((2^{10})^3\)
\(2^{40}\) tebi Ti Terabinary: \((2^{10})^4\)
\(2^{50}\) pebi Pi Petabinary: \((2^{10})^5\)

Table 3.2: Data conversion table. Credit: http://physics.nist.gov/cuu/Units/binary.html

Even though there is now an agreed standard for discussing memory, that doesn’t mean that everyone follows it. Microsoft Windows, for example, uses 1MB to mean \(2^{20}\)B. Even more confusing the capacity of a \(1.44\)MB floppy disk is a mixture, \(1\text{MB} = 10^3 \times 2^{10}\)B.