\myheading{Text strings right in the middle of compressed data} %\myindex{Linux kernel} You can download Linux kernels and find English words right in the middle of compressed data: \begin{lstlisting} % wget https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.10.2.tar.gz % xxd -g 1 -seek 0x515c550 -l 0x30 linux-4.10.2.tar.gz 0515c550: c5 59 43 cf 41 27 85 54 35 4a 57 90 73 89 b7 6a .YC.A'.T5JW.s..j 0515c560: 15 af 03 db 20 df 6a 51 f9 56 49 52 55 53 3d da .... .jQ.VIRUS=. 0515c570: 0e b9 29 24 cc 6a 38 e2 78 66 09 33 72 aa 88 df ..)$.j8.xf.3r... \end{lstlisting} \begin{lstlisting} % wget https://cdn.kernel.org/pub/linux/kernel/v2.3/linux-2.3.3.tar.bz2 % xxd -g 1 -seek 0xa93086 -l 0x30 linux-2.3.3.tar.bz2 00a93086: 4d 45 54 41 4c cd 44 45 2d 2c 41 41 54 94 8b a1 METAL.DE-,AAT... 00a93096: 5d 2b d8 d0 bd d8 06 91 74 ab 41 a0 0a 8a 94 68 ]+......t.A....h 00a930a6: 66 56 86 81 68 0d 0e 25 6b b6 80 a4 28 1a 00 a4 fV..h..%k...(... \end{lstlisting} One of Linux kernel patches in compressed form has the ``Linux'' word itself (not a bad self-referential joke): \begin{lstlisting} % wget https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/patch-4.6-rc4.gz % xxd -g 1 -seek 0x4d03f -l 0x30 patch-4.6-rc4.gz 0004d03f: c7 40 24 bd ae ef ee 03 2c 95 dc 65 eb 31 d3 f1 .@$.....,..e.1.. 0004d04f: 4c 69 6e 75 78 f2 f3 70 3c 3a bd 3e bd f8 59 7e Linux..p<:.>..Y~ 0004d05f: cd 76 55 74 2b cb d5 af 7a 35 56 d7 5e 07 5a 67 .vUt+...z5V.^.Zg \end{lstlisting} Other English words I've found in other compressed Linux kernel trees: \begin{lstlisting} linux-4.6.2.tar.gz: [maybe] at 0x68e78ec linux-4.10.14.tar.xz: [OCEAN] at 0x6bf0a8 linux-4.7.8.tar.gz: [FUNNY] at 0x29e6e20 linux-4.6.4.tar.gz: [DRINK] at 0x68dc314 linux-2.6.11.8.tar.bz2: [LUCKY] at 0x1ab5be7 linux-3.0.68.tar.gz: [BOOST] at 0x11238c7 linux-3.0.16.tar.bz2: [APPLE] at 0x34c091 linux-3.0.26.tar.xz: [magic] at 0x296f7d9 linux-3.11.8.tar.bz2: [TRUTH] at 0xf635ba linux-3.10.11.tar.bz2: [logic] at 0x4a7f794 \end{lstlisting} %\myindex{Apophenia} %\myindex{Pareidolia} %\myindex{Lurkmore} There is a nice illustration of apophenia and pareidolia (human's mind ability to see faces in clouds, etc) in Lurkmore, Russian counterpart of Encyclopedia Dramatica. As they wrote in the article about electronic voice phenomenon\footnote{\url{http://archive.is/gYnFL}}, you can open any long enough compressed file in hex editor and find well-known 3-letter Russian obscene word, and you'll find it a lot: but that means nothing, just a mere coincidence. And I was interested in calculation, how big compressed file must be to contain all possible 3-letter, 4-letter, etc, words? In my naive calculations, I've got this: probability of the first specific byte in the middle of compressed data stream with maximal entropy is $\frac{1}{256}$, probability of the 2nd is also $\frac{1}{256}$, and probability of specific byte pair is $\frac{1}{256 \cdot 256} = \frac{1}{256^2}$. Probability of specific triple is $\frac{1}{256^3}$. If the file has maximal entropy (which is almost unachievable, but \dots) and we live in an ideal world, you've got to have a file of size just $256^3=16777216$, which is 16-17MB. %\myindex{rafind2} You can check: get any compressed file, and use \textit{rafind2} to search for any 3-letter word (not just that Russian obscene one). It took $\approx$ 8-9 GB of my downloaded movies/TV series files to find the word ``beer'' in them (case sensitive). Perhaps, these movies wasn't compressed good enough? This is also true for a well-known 4-letter English obscene word. My approach is naive, so I googled for mathematically grounded one, and have find this question: ``Time until a consecutive sequence of ones in a random bit sequence'' \footnote{\url{http://math.stackexchange.com/questions/27989/time-until-a-consecutive-sequence-of-ones-in-a-random-bit-sequence/27991\#27991}}. The answer is: $(p^{−n}−1)/(1−p)$, where $p$ is probability of each event and $n$ is number of consecutive events. Plug $\frac{1}{256}$ and $3$ and you'll get almost the same as my naive calculations. So any 3-letter word can be found in the compressed file (with ideal entropy) of length $256^3 = \approx 17MB$, any 4-letter word --- $256^4 = 4.7GB$ (size of DVD). Any 5-letter word --- $256^5 = \approx 1TB$. For the piece of text you are reading now, I mirrored the whole \href{https://www.kernel.org/}{kernel.org} website (hopefully, sysadmins can forgive me), and it has $\approx$ 430GB of compressed Linux Kernel source trees. It has enough compressed data to contain these words, however, I cheated a bit: I searched for both lowercase and uppercase strings, thus compressed data set I need is almost halved. This is quite interesting thing to think about: 1TB of compressed data with maximal entropy has all possible 5-byte chains, but the data is encoded not in chains itself, but in the order of chains (no matter of compression algorithm, etc). Now the information for gamblers: one should throw a dice $\approx 42$ times to get a pair of six, but no one will tell you, when exactly this will happen. %\myindex{Rosencrantz \& Guildenstern Are Dead} I don't remember, how many times coin was tossed in the ``Rosencrantz \& Guildenstern Are Dead'' movie, but one should toss it $\approx 2048$ times and at some point, you'll get 10 heads in a row, and at some other point, 10 tails in a row. Again, no one will tell you, when exactly this will happen. Compressed data can also be treated as a stream of random data, so we can use the same mathematics to determine probabilities, etc. If you can live with strings of mixed case, like ``bEeR'', probabilities and compressed data sets are much lower: $128^3=2MB$ for all 3-letter words of mixed case, $128^4=268MB$ for all 4-letter words, $128^5=34GB$ for all 5-letter words, etc. %\myindex{Jorge Luis Borges} Moral of the story: whenever you search for some patterns, you can find it in the middle of compressed blob, but that means nothing else then coincidence. In philosophical sense, this is a case of selection/confirmation bias: you find what you search for in ``The Library of Babel'' \footnote{A short story by Jorge Luis Borges}. Also, a citation from a Marten Gardner's book about randomness of $\pi$: \begin{framed} \begin{quotation} So far pi has passed all statistical tests for randomness. This is disconcerting to those who feel that a curve so simple and beautiful as the circle should have a less-disheveled ratio between the way around and the way across, but most mathematicians believe that no pattern or order of any sort will ever be found in pi's decimal expansion. Of course the digits are not random in the sense that they represent pi, but then in this sense neither are the million random digits that have been published by the Rand Corporation of California. They too represent a single number, and an integer at that. If it is true that the digits in pi are random, perhaps we are justified in stating a paradox somewhat similar to the assertion that if a group of monkeys pound long enough on typewriters, they will eventually type all the plays of Shakespeare. Stephen Barr has pointed out that if you set no limit to the accuracy with which two bars can be constructed and measured, then those two bars, without any markings on them, can communicate the entire Encyclopaedia Britannica. One bar is taken as unity. The other differs from unity by a fraction that is expressed as a very long decimal. This decimal codes the Britannica by the simple process of assigning a different number (excluding zero as a digit in the number) to every word and mark of punctuation in the language. Zero is used to separate the code numbers. Obviously the entire Britannica can now be coded as a single, but almost inconceivably long, number. Put a decimal point in front of this number, add 1, and you have the length of the second of Barr's bars. Where does pi come in? Well, if the digits in pi are really random, then somewhere in this infinite pie there should be a slice that contains the Britannica; or, for that matter, any book that has been written, will be written, or could be written. \end{quotation} \end{framed} ( Martin Gardner -- New Mathematical Diversions\footnote{\url{https://books.google.com/books?id=VE0FEAAAQBAJ}} ) Also: \begin{framed} \begin{quotation} A sequence of six 9's occurs in the decimal representation of the number pi (π), starting at the 762nd decimal place. It has become famous because of the mathematical coincidence and because of the idea that one could memorize the digits of π up to that point, recite them and end with "nine nine nine nine nine nine and so on", which seems to suggest that π is rational. The earliest known mention of this idea occurs in Douglas Hofstadter's 1985 book Metamagical Themas, where Hofstadter states\\ \\ I myself once learned 380 digits of π, when I was a crazy high-school kid. My never-attained ambition was to reach the spot, 762 digits out in the decimal expansion, where it goes "999999", so that I could recite it out loud, come to those six 9's, and then impishly say, "and so on!" — Douglas Hofstadter, Metamagical Themas \end{quotation} \end{framed} ( \url{https://en.wikipedia.org/wiki/Six_nines_in_pi} )