Just wondering if it is possible to create a file which has its md5sum inside it along with other contents too. |
|||||||||||||||||||||
migrated from unix.stackexchange.com May 16 '11 at 7:05This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems. |
|||||||||||||||||||||
|
Theoretically? Yes. Practically, however, since /any/ change to a file's contents, no matter how minute, causes a drastic change in the checksum (which is how md5 checksums work, after all), you'd need to be able to predict how the checksum will change when you alter the file to include the checksum -- for all intents and purposes this isn't much different from being able to break the md5 hashing algorithm. There's no such thing as "impossible" in cryptography, but the science does acknowledge the concept of "practically undoable" or "statistically improbable" and that's pretty much what you're dealing with here, at the moment. |
|||||||||||||||||||||
|
Consider this: you create a file that contains every member of the set of 16-byte sequences. An MD5 checksum is a 16-byte sequence, so by definition this file contains its own MD5 checksum. Somewhere. |
|||||||||||||||||||||
|
Cryptographically speaking the attack you are describing is actually harder than finding a first preimage, maybe even harder than finding a second preimage. This is not possible given today's computing power and today's crypto attacks . Current attacks on MD5 don't even come close to finding preimages - we are talking about something completely different than the various collisions attacks that have been demonstrated (and are the reason MD5 is considered somewhat insecure). The attack that would be required to create a file with it's MD5 in it has nothing to do with collisions. I would say that such attack, because as I mentioned is even harder than a preimage attack, is very unlikely in our lifetimes. |
|||||||||
|
Update: thinking about it again, I found a method that should allow the construction of a file containing its own MD5 much faster than what I was explaining initially. The new cost should be about 265 elementary invocations of MD5, i.e. a lot less than the 2119 I was talking about; it would even be technologically feasible (with a budget counted in millions of dollars -- but not billions). See at the end for a description of the new method. Original answer: Let's assume that MD5 is a "perfect" hash function which can be modeled as a random oracle. A random oracle is a function for which you know nothing of the output for a given input before trying it once. For a random oracle, the best method to achieve what you are looking for is hope: you try random input messages until you find one which contains its own hash. The question is then: what size of input messages should you use ? MD5 processes data by adding some bits of padding (at least 65, at most 576) so that the length is a multiple of 512; then data is split into 512-bit blocks. The cost of hashing a message is directly proportional to the number of such blocks. I.e. for a n-bit message, the cost is ceil((n+65)/512). A n-bit message, on the other hand, offers n-127 subsequences of 128 bits. Longer messages make it more probable to succeed at each message (in a linear way) but cost more to process (linearly too). So message length is mostly neutral, except that the overhead implied by the padding is larger when using short messages. Overall, with large enough random messages (e.g. 8 kB), you will find a message which contains its own MD5 in average cost about 2119 MD5 elementary evaluation. An elementary evaluation of MD5 uses a few hundred clock cycles on a recent CPU, and 2119 is totally unachievable with today's technology (and tomorrow's technology, too). (The "big file with all 128-bit sequence" that Graham Lee is talking about is just a special case of this generic method, with a single very large message.) Now MD5 is widely known to not be a random oracle -- if only because collisions on MD5 can be computed efficiently, something which is not possible with a random oracle. So it is conceivable that shortcuts exploiting weaknesses in MD5 structure exist. However, I am not aware of any attack leading to a message containing its own MD5; this looks like a problem close to preimage resistance, something which is viewed as substantially more difficult than collisions. New method: MD5, like most (If not all) hash functions, is streamed: when it processes a long input, it does so in one pass, keeping a small fixed-size running state. For MD5 specifically, the running state has size 128 bits (16 bytes), and data is processed in chunks of 512 bits (64 bytes). An important consequence is the following: if you have inputs m and m||x ("||" denotes concatenation), and you want to compute both MD5(m) and MD5(m||x), then the extra cost needed to compute the second one is proportional to the size of x, but NOT to the size of m. In other words, if you have a 1 gigabyte input m, compute MD5(m), and then want to compute the MD5 of m followed by a 20-byte trailer x, then that second MD5 can reuse much of the work done for the first one, and will be almost free. This leads to the following algorithm for finding a message m that contains its own MD5:
Finding the right "x" value at each step can be done by using a De Bruijn sequence. Use B(2, 128) as the base sequence if each x is a single bit. If you want a byte-oriented solution (the message m must consist of an integral number of bytes, and MD5(m) must appear within m at a byte boundary), then use B(256, 16). To compute the average number of iterations needed to find a hit, consider that at iteration n, the message m contains n distinct subsequences of 128 bits (or 16 bytes), so the total accumulated number of comparisons will be n(n+1)/2. Assuming MD5 to be a random oracle, then each comparison has probability 2-128 of being a hit, so n will have, on average, to be such that n(n+1)/2 = 2128 -- which translates to n = 264.5 iterations. However, each iteration involves computing a MD5(m||x) where x is very small (one bit or one byte), and MD5(m) has been computed; this will usually require only one extra elementary MD5 computation (processing of a single 64-byte block). (If x are bits then only one iteration in 512 will require processing two blocks; if x are bytes then this becomes one iteration in 64.) Either way, the hard part will be the lookup. Getting all subsequences in an index suitably sorted for fast lookup will require an awful lot of fast RAM, which would probably be way more expensive than computing the 264.5 MD5. However, some De Bruijn sequences allow for a fast, storage-free decoding. Therefore, with this algorithm, we can find a message m that contains its own MD5 for a cost close to 265 computations of MD5. The resulting message will have length about 3.3*1018bytes, i.e. about one million modern hard disks (eight times as much if we want a byte-oriented solution). It may be noted that the algorithm can be started with an arbitrary message m, of any size. That starting point will appear at the start of the self-MD5 file that the algorithm produces. (In my original answer, the mistake was in this sentence: "Longer messages make it more probable to succeed at each message (in a linear way) but cost more to process (linearly too)." As explained above, longer messages can still be processed very efficiently as long as we generate them by reusing a common prefix, as in my new algorithm.) |
||||
(Copying my original comment as answer:) You'd be better off creating a section of the file for the md5 / hash, and a seperate section for content. On the other hand, since anyone can recreate the hash part, what security value would you get from this? |
|||||
|
You could do it through use of an Alternate Data Stream, though the information may not transfer properly between certain file systems or OS's. Certain applications may handle (or not handle) these differently also. In short, Alternate Data Streams are a form of metadata attached to files in some file systems (NTFS is one) which does not appear readily when viewing a directory's contents. Even with the system set to show "hidden files" and those ever-critical "protected operating system files" you still will not see an ADS "file" in most file managers. Additionally, the "host" file itself will not appear changed at all. The file's size will remain the same, and even the MD5 hash (or any other, for that matter) will be the same. You could even conceivably store an ADS "file" that is larger than its host file - although of course you cannot store one so large that it goes beyond the physical capacity of your drive. In Windows systems with NTFS, ADS files are most easily accessed via the command line. So, for File1.ext, if you want to store the MD5 hash in an ADS, do the following:
Again, ADS's are handled differently by different OS's and file systems. So, they're not likely to traverse the Internet (or even some LANs or sneakernets) very well. But, it is a way of doing what it is you seem to be wanting to do. For further details, instructions, or utilities, consult Google. |
|||||
|
What you are asking about is the existence of a fixpoint of the composition of two functions: the As the Wikipedia article says, not all functions have a fixpoint. For some trivial functions in Let's for now restrict our set It is also easy to see that, if Now take a look at this stackoverflow question: Is there an MD5 Fixed Point where md5(x) == x?. In particular, take a look at Adam Rosenfield's answer. In it, we can see that there is a 63.21% probability of The same argument used in that answer can be applied to It is easy to see, as mentioned on that answer, that the same argument will also apply to any file which depends on enough of its md5 output. For files which depend on only a couple of bits of the md5 output, or do not depend on it at all (including the one on @Graham Lee's answer, which depends on 0 bits of the md5 output), the answer will be different. |
|||||||||||||||||
|
In the general case, no, since adding the MD5 sum would modify the file itself and thus its MD5 sum, most of the time... However, for specifically crafted files, it could be possible, using a collision attack. There is an example of collision attacks where two PostScript files are designed to have the same MD5 sum here (there are paper references too): http://th.informatik.uni-mannheim.de/people/lucks/HashCollisions/ You might be able to use the same approach to generate a second file that would contain the original content, its MD5 sum, and some extra content to make the collision. |
|||||||||
|
You could have something like a wrapper format that have the MD5 in part of the file and the real content in another part of it. This would useless because if the attacker can change the content then he can also change the MD5 to match the new content. |
|||||||||||||||||
|
From my understanding of hashing this is totally impossible. In order to calculate the hash the entirety of the data is ingested. This is then used to generate the hash, this data would then have to be appended to the file thus changing the data and also the hash. [http://www.forensicswiki.org/wiki/Hashing this might be of some use to you] |
|||
Thinking a bit outside of the box, you could code a variable to store the md5 of a file (the exact file) and so while the actual md5 of the file in plane text will not be included in the file itself, if the file were runnable code it could be programmed to store a value of its own md5 as a variable. To further enhance such an idea (in order to give it some sort of usable value), you could store the md5 in a separate file (created upon completion) and secure that second file in such a way that only the first file contains any reasonable method of accessing it and comparing against the previously calculated md5 variable. The actual usefulness of such an idea is probably limited only to the ability to check that the file itself has not been altered and to alert of a security breach that has already happened rather than to stop one from happening in the first place. |
|||
If you do not need the bits to be in sequence, then sure, this is pretty easy. This file will work:
(File is shown in binary.) The md5sum of it is
The md5sum is inside the file (although not in sequence). This is the same file as above, with the md5sum shown as it is, while the other contents are replaced with
Of course, this file would work too:
It's md5sum is
And here is the md5sum as seen within the file:
|
|||||||||
|
protected by Community♦ Oct 9 '15 at 12:39
Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?