If bit remainder is 2, you need to prepend one base64 char, and throw out one leading byte after decoding: #|= If start bit remainder is 0, it means that A and x are aligned, so no changes required: |= So, how much to prepend depends on value of bit remainder. So, here goes the trick - you can prepend and append gibberish to your extracted base64 code, until base64 and byte boundaries align - and knowing how much invalid input will base64 decoder produce, throw out excess. And to do that, you should know that each 4 chars of base64 code correspond to 3 bytes of data. To do that, it makes sense to leverage existing base64 decoders, so we won't need to deal with base64 encoding yourself. Now, after you know exact base64 chars you want, you need to decode them. Well, this maps to the scheme above - your span starts after 1 full base64 char and 2 bits, and ends after 6 full base64 chars and 4 bits. To do that, divmod your offset and offset+length by 6: start = bit 8 = char 1 + bit 2 Now, after you know you want 32 bits, starting with bit 8, you can find what base64 character contains your starting bits. They are easy to calculate, just multiply your byte distances by 8. Then, to decode only parts that you want, you need to know bit distances. Ä«ase64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH If you would like to extract 4 bytes of data, starting with offset 1, like this. There are few extremely easy approaches to the problem I can think of: Partial decodeÄ®ach base64 character encodes 6 bits of input, so you can relate them as follows: Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHHÄata: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee There is an equivalent soundhdr module, and there is also the python-magic project that lets you pass in a number of bytes to determine a file type. I only used the first 33 bytes from the base64 data, to echo what the imghdr.what() function will read from the file you pass it (it reads 32 bytes, but that number doesn't divide by 3). > sample = image_code('base64') # 33 bytes / 3 times 4 is 44 base64 chars > image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=""" Your sample is a PNG image you can test for image types using the imghdr module: > import imghdr Decoding just those bytes from the base64 string is trivial. A large number of file formats can be identified from just the first or last series of bytes (a PNG image can be identified by the first 8 bytes, a GIF by the first 6, etc.). So you can decode the first 4 characters to get the 3 bytes, and then use the first two to see if the object is a JPEG image. What you can do is decode just enough of the base64 string to do your filetype fingerprinting. A JPEG image for example, can be identified from the bytes FF D8 or FF D9, but that's two bytes the third byte that follows must also be encoded as part of the 4-character block. Identifying a filetype requires access to those bytes in different block sizes. Each character encodes 6 bits, which means that for every 4 characters, there are 3 bytes encoded. You can't, at least not without decoding, because the bytes that help identify the filetype are spread across the base64 characters, which don't directly align with whole bytes.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |