binary grep to find mp3s stored in zip

Once upon  a time whilst sorting my mp3 collection, I ran across a little problem. The tags which I had found and saved to the id3 tags using quod libet (being my favourite tag editor – better than the related Ex-Falso incidentally) were not showing up in Rhythmbox (being my favourite actual audio player). Instead, Rhythmbox was seeing the OLD tags.

This is curious, thinks I. And I go off in search for answers…

So I am examining the file directly (using less), and eventually notice it has an interesting structure. It is in fact an mp3 inside a zip file (using the STORE method) with then NEW id3 tags surrounding the zip! End result – ‘file‘ sees the id3 tags, and assumed mp3 data. Quodlibet was writing the outside layer of  id3 data, but rhythmbox was reading the inside (true mp3) id3. (note that in examining backups, the zip data already had the false id3 tags before quodlibet ever saw it)

Anyway, fixing was easy. First strip the outside id3 tags using id3v2 (or similar). Then extract the mp3 from the zip file (which in this case had been renamed with an .mp3 file extension), and finally re-tag using the original method.

I note that these files had not previously been caught since the zip file had them stored using the ‘STORE’ method. ie, uncompressed. All music players tested (quodlibet, rhythmbox, mplayer, mpg321) played the file fine – effectively ignoring the multiple leading (and trailing) id3 tags and PKZIP headers and footers. They just found mp3 data directly as expected and played it. And no, I can’t tell you now which players saw which set of id3 data. I didn’t test this.

Ok, that’s all well and good, but what about the rest of the music collection? Could this have happened elsewhere?

Sure could! And the solution is simple. Grep for the PKZIP file header in EVERY audio file you have.

This is the core of this post however, as it turns out that of the standard GNU commandline grep and variants (egrep, rgrep, etc), none of them can grep for arbitrary binary strings.

commandline FAIL!

Seriously. I was surprised that what seemed like such an obvious thing was unavailable using standard tools!

Fortunately, like all good problems, someone had scratched this ones itch – and I quickly found bgrep. (Here is the link again and in the clear, because this is the central point of this blog:  http://debugmo.de/?p=100 )

It’s small, simple, compiled in an instant, and did exactly what I wanted: which in this case was to grep for the hex string 504B0304 – that being the four byte header of a PKZIP file.

So let’s get to the core  of it. You want to binary grep your entire music collection for a PKZIP file header – so you’ll want to run something like this  (assuming $PWD is your music library directory):

find . -type f -print0 | xargs -n1 -0 bgrep 504B0304

You’ll want to capture the  output for sorting, as for each match it gives the matching filename, and the hex offset the match (ie, what you’d expect given the context of the original grep).

A true zip file will have an offset of 00000000, whilst a zip file with an unwanted id3 header will have an offset approximately in the range between 00000500 and  00000700.

Remember, this is only a four character binary string, and is quite likely to occur elsewhere in your data, given a large enough data set. Any hits from this search will still require individual evaluation, in which case they can be cleaned up with the suggestions noted previously.

Happy musics everyone!

<edit>
tmbinc – the author of bgrep – pointed out to me that bgrep is recursive by default anyway, so in fact the only command needed from your music directory is:

bgrep 504B0304 *
=)
</edit>

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>