diff options
author | Thomas Voss <mail@thomasvoss.com> | 2023-08-15 14:57:32 +0200 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2023-08-15 14:57:32 +0200 |
commit | d5635e946e9df6f519ec8cf08cebfc35dbe6c788 (patch) | |
tree | 46893cffdf23a2b15f8b7839c69d5df2bcbb8bca /src/prj/mmv/index.html | |
parent | cfa35dcb2d332977e80a5811b6d42e9949bd4814 (diff) |
Add a post on ‘mmv’
Diffstat (limited to 'src/prj/mmv/index.html')
-rw-r--r-- | src/prj/mmv/index.html | 658 |
1 files changed, 658 insertions, 0 deletions
diff --git a/src/prj/mmv/index.html b/src/prj/mmv/index.html new file mode 100644 index 0000000..d13f7c8 --- /dev/null +++ b/src/prj/mmv/index.html @@ -0,0 +1,658 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + m4_include(head.html) + </head> + <body> + <header> + <div> + <h1>Moving Files the Right Way</h1> + m4_include(nav.html) + </div> + + <figure class="quote"> + <blockquote> + <p>I think the OpenBSD crowd is a bunch of masturbating + monkeys, in that they make such a big deal about + concentrating on security to the point where they pretty much + admit that nothing else matters to them.</p> + </blockquote> + <figcaption> + Linux Torvalds + </figcaption> + </figure> + </header> + + <main> + <p> + <em> + You can find the <code>mmv</code> git repository over at + <a href="https://git.sr.ht/~mango/mmv" + target="_blank">sourcehut</a> + or <a href="https://github.com/Mango0x45/mmv">GitHub</a>. + </em> + </p> + + <h2>Table of Contents</h2> + + <ul> + <li><a href="#prologue">Prologue</a></li> + <li><a href="#moving">Advanced Moving and Pitfalls</a></li> + <li><a href="#mapping">Name Mapping with <code>mmv</code></a></li> + <li><a href="#newlines">Filenames with Embedded Newlines</a></li> + <ul> + <li><a href="#0-flag">The Simple Case</a></li> + <li><a href="#e-flag">Encoding Newlines</a></li> + </ul> + <li><a href="#i-flag">Individual Execution</a></li> + <li><a href="#safety">Safety</a></li> + <li><a href="#examples">Examples</a></li> + </ul> + + <h2 id="prologue">Prologue</h2> + <p> + File moving and renaming is one of the most common tasks we + undertake on the command-line. We basically always do this with + the <code>mv</code> utility, and it gets the job done most of the + time. Want to rename one file? Use <code>mv</code>! Want to + move a bunch of files into a directory? Use <code>mv</code>! + How could mv ever go wrong? Well I’m glad you asked! + </p> + + <h2 id="moving">Advanced Moving and Pitfalls</h2> + <p> + Let’s start off nice and simple. You just inherited a C project + that uses the sacrilegious + <a + href="https://en.wikipedia.org/wiki/Camel_case" + target="_blank" + >camelCase</a> + naming convention for its files: + </p> + + <figure> + <pre>m4_fmt_code(ls-files.sh.html)</pre> + </figure> + + <p> + This deeply upsets you, as it upsets me. So you decide you want + to switch all these files to use + <a + href="https://en.wikipedia.org/wiki/Snake_case" + target="_blank" + >snake_case</a>, + like a normal person. Well how would you do this? You use + <code>mv</code>! This is what you might end up doing: + </p> + + <figure> + <pre>m4_fmt_code(manual-mv.sh.html)</pre> + </figure> + + <p> + Well… it works I guess, but it’s a pretty shitty way of renaming + these files. Luckily we only had 5, but what if this was a much + larger project with many more files to rename? Things would get + tedious. So instead we can use a pipeline for + this: + </p> + + <figure> + <pre>m4_fmt_code(camel-to-snake-naïve.sh.html)</pre> + </figure> + + <aside> + <p> + The given example assumes your <code>sed</code> + implementation supports ‘<code>\L</code>’ which is a + non-standard <abbr class="gnu">GNU</abbr> extension. + </p> + </aside> + + <p> + That works and it gets the job done, but it’s not really ideal is + it? There are a couple of issues with this. + </p> + + <ol> + <li> + <p> + You’re writing more complicated code. This has the + obvious drawback of potentially being more error-prone, + but also risks taking more time to write than you’d like + as you might have forgotten if <code>xargs</code> + actually has an ‘<code>-L</code>’ option or not (which + would require reading the + <a href="https://www.man7.org/linux/man-pages/man1/xargs.1.html" + target="_blank" ><code>xargs(1)</code></a> manual). + </p> + </li> + <li> + <p> + If you try to rename the file <em>foo</em> + to <em>bar</em> but <em>bar</em> already exists, you end + up deleting a file you may not have wanted to. + </p> + </li> + <li> + <p> + In a similar vein to the previous point, you need to be + very careful about schemes like renaming the + file <em>a</em> to <em>b</em> and <em>b</em> + to <em>c</em>. You run the risk of turning <em>a</em> + into <em>c</em> and losing the file <em>b</em> entirely. + </p> + </li> + <li> + <p> + Moving symbolic links is its own whole can of worms. If + a symlink points to a relative location then you need to + make sure you keep pointing to the right place. If the + symlink is absolute however then you can leave it + untouched. But what if the symlink points to a file + that you’re moving as part of your batch move operation? + Now you need to handle that too. + </p> + </li> + </ol> + + <h2 id="mapping">Name Mapping with <code>mmv</code></h2> + + <p> + What is <code>mmv</code>? It’s the solution to all your + problems, that’s what it is! <code>mmv</code> takes as its + argument(s) a utility and that utilities arguments and uses that + to create a mapping between old and new filenames — similar to + the <code>map()</code> function found in many programming + languages. I think to best convey how the tool functions, I + should provide an example. Let’s try to do the same thing we did + previously where we tried to turn camelCase files to snake_case, + but using <code>mmv</code>: + </p> + + <figure> + <pre>m4_fmt_code(camel-to-snake-smart.sh.html)</pre> + </figure> + + <p>Let me break down how this works.</p> + + <p> + <code>mmv</code> starts by reading a series of filenames + separated by newlines from the standard input. Yes, sometimes + filenames have newlines in them and yes there is a way to handle + them but I shall get to that later. The filenames that + <code>mmv</code> reads from the standard input will be referred + to as the <em>input files</em>. Once all the input files have + been read, the utility specified by the arguments is spawned; in + this case that would be <code>sed</code> with the argument + <code>'s/[A-Z]/\L_&/g'</code>. The input files are then piped + into <code>sed</code> the exact same way that they would have + been if we ran the above commands without <code>mmv</code>, and + the output of <code>sed</code> then forms what will be referred + to as the <em>output files</em>. Once a complete list of output + files is accumulated, each input file gets renamed to its + corresponding output file. + </p> + + <p> + Let’s look at a simpler example. Say we want to rename 2 files + in the current directory to use lowercase letters, we could use + the following command: + </p> + + <figure> + <pre>m4_fmt_code(mmv-tr.sh.html)</pre> + </figure> + + <p> + In the above example <code>mmv</code> reads 2 lines from + standard input, those being <em>LICENSE</em> + and <em>README</em>. Those are our 2 input files now. + The <code>tr</code> utility is then spawned and the input files + are piped into it. We can simulate this in the shell: + </p> + + <figure> + <pre>m4_fmt_code(tr.sh.html)</pre> + </figure> + + <p> + As you can see above, <code>tr</code> has produced 2 lines of + output; these are our 2 output files. Since we now have our 2 + input files and 2 output files, <code>mmv</code> can go ahead + and rename the files. In this case it will rename + <em>LICENSE</em> to <em>license</em> and + <em>README</em> to <em>readme</em>. For some examples, check + the <a href="#examples">examples</a> section of this page down + below. + </p> + + <h2 id="newlines">Filenames with Embedded Newlines</h2> + + <p> + People are retarded, and as a result we have filenames with + newlines in them. All it would have taken to solve this issue + for everyone was for literally <strong>anybody</strong> during + the early UNIX days to go “<em>hey, this is a bad idea!</em>”, + but alas, we must deal with this. Newlines are of course not + the only special characters filenames can contain, but they are + the single most infuriating to deal with; the UNIX utilities all + being line-oriented really doesn’t work well with these files. + </p> + + <p> + So how does <code>mmv</code> deal with special characters, and + newlines in particular? Well it does so by providing the user + with the <code>-0</code> and <code>-e</code> flags: + </p> + + <dl> + <dt><code>-0</code></dt> + <dd> + <p> + Tell <code>mmv</code> to expect its input to not be + separated by newlines (‘<code>\n</code>’), but by NUL + bytes (‘<code>\0</code>’). NUL bytes are the only + characters not allowed in filenames besides forward + slashes, so they are an obvious choice for an + alternative separator. + </p> + </dd> + <dt><code>-e</code></dt> + <dd> + <p> + Encode newlines in filenames before passing them to the + provided utility. Newline characters are replaced by the + literal string ‘<code>\n</code>’ and backslashes by the + literal string ‘<code>\\</code>’. After processing, the + resulting output is decoded again. + </p> + <p> + If combined with the <code>-0</code> flag, then while + input will be read assuming a NUL-byte input-seperator, + the encoded input files will be written to the spawned + process newline-seperated. + </p> + </dd> + </dl> + + <h3 id="0-flag">The Simple Case</h3> + + <p> + In order to better understand these flags and how they work + let’s go though another example. We have 2 files — one with and + one without an embedded newline — and our goal is to simply + reverse these filenames. In this example I am going to be + displaying newlines in filenames with the “<code>$'\n'</code>” + syntax as this is how my shell displays embedded newlines. + </p> + + <p> + We can start by just trying to naïvely pass these 2 files + to <code>mmv</code> and use <code>rev</code> to reverse the + names, but this doesn’t work: + </p> + + <figure> + <pre>m4_fmt_code(mmv-rev.sh.html)</pre> + </figure> + + <p> + The reason this doesn’t work is because due to the line-oriented + nature of <code>ls</code> and <code>rev</code>, we are actually + trying to rename the files <em>foo</em>, <em>bar</em>, and + <em>baz</em> to the new filenames <em>zab</em>, + <em>rab</em>, and <em>oof</em>. As can be seen in the following + diagram, the embedded newline is causing our input to be ambiguous + and <code>mmv</code> can’t reliably proceed + anymore <x-ref>1</x-ref>: + </p> + + <figure> + <object data="conflict.svg" type="image/svg+xml"></object> + </figure> + + <aside> + <p data-ref="1"> + The reason you get a cryptic “file not found” error message + is because <code>mmv</code> tries to assert that all the + input files actually exist before doing anything. Since + “foo” isn’t a real file, we error out. + </p> + </aside> + + <p> + The first thing we need to do in order to proceed is to pass + the <code>-0</code> flag to <code>mmv</code>. This will + tell <code>mmv</code> that we want to use the NUL-byte as our + input separator and not the newline. We also need <code>ls</code> + to actually provide us with the filenames delimited by NUL-bytes. + Luckily <abbr class="gnu">GNU</abbr> <code>ls</code> gives us the + <code>--zero</code> flag to do just that: + </p> + + <figure> + <pre>m4_fmt_code(mmv-rev-zero.sh.html)</pre> + </figure> + + <p> + So we’re getting places, but we aren’t quite there yet. The + issue we’re getting now is that <code>mmv</code> recieved 2 + input files from the standard input, but <code>rev</code> + produced 3 output files. Why is that? Well let’s try our hand + at a little bit of command-line debugging with <code>sed</code>: + </p> + + <figure> + <pre>m4_fmt_code(sed-debugging.sh.html)</pre> + </figure> + + <p> + If you aren’t quite sure what the above is doing, here’s a quick + summary: + </p> + + <ul> + <li> + The <code>-U</code> flag given to <code>ls</code> tells it + not to sort our output. This is purely just to keep this + example clear to the reader. + </li> + <li> + The <code>-n</code> flag given to <code>sed</code> tells it + not to print the input line automatically at the end of the + provided script. + </li> + <li> + The <code>l</code> command in <code>sed</code> prints the + current input in a “visually unambiguous form”. + </li> + </ul> + + <p> + In the <code>sed</code> output, we can see that <samp>$</samp> + represents the end of a line, and <samp>\000</samp> represents + the NUL-byte. All looks good here, we have two inputs seperated + by NUL-bytes. Now let’s try to throw in <code>rev</code>: + </p> + + <figure> + <pre>m4_fmt_code(sed-debugging-rev.sh.html)</pre> + </figure> + + <p> + Well wouldn’t you know it? Since <code>rev</code> <em>also</em> + works with newline-seperated input, it reversed out NUL-byte + seperators and now gives us 3 outputs. Luckily the folks over + at <em>util-linux</em> provided us with the <code>-0</code> flag + here too, so that we can properly handle NUL-delimited input. + Combining all of this together we get a final working product: + </p> + + <figure> + <pre>m4_fmt_code(reverse-embedded-newline.sh.html)</pre> + </figure> + + <h3 id="e-flag">Encoding Newlines</h3> + + <p> + Sometimes we want to rename a bunch of files, but the command we + want to use doesn’t support NUL-bytes as nicely as we would + like. In these cases, you may want to consider encoding your + newline characters into the literal string ‘<code>\n</code>’ and + then passing your input newline-seperated to your given command + with the <code>-e</code> flag. + </p> + + <p> + For a real-world example, perhaps you want to edit some + filenames in vim, or whatever other editor you use. Well we can + do this incredibly easily with the <code>vipe</code> utility + from + the <a href="https://joeyh.name/code/moreutils/">moreutils</a> + collection. The <code>vipe</code> command simply reads input + from the standard input, opens it up in your editor, and then + prints the resulting output to the standard output; perfect + for <code>mmv</code>! We do not really want to deal with + NUL-bytes in our text-editor though, so let’s just encode our + newlines: + </p> + + <figure> + <pre>m4_fmt_code(vipe.sh.html)</pre> + </figure> + + <aside> + <p> + Notice how you still need to pass the <code>-0</code> flag + to <code>mmv</code> know that our inputfiles may have + embedded newlines. + </p> + </aside> + + <p> + When running the above code example, you will see the following + in your editor: + </p> + + <figure> + <pre>m4_fmt_code(vim.html)</pre> + </figure> + + <p> + After you exit your editor, <code>mmv</code> will decode all + occurances of ‘<code>\n</code>’ back into a newline, and all + occurances of ‘<code>\\</code>’ back into a backslash: + </p> + + <figure> + <object data="e-flag.svg" type="image/svg+xml"></object> + </figure> + + <h2 id="i-flag">Individual Execution</h2> + <p> + The previous examples are great and all, but what do you do if + your mapping command doesn’t have the concept of an input + seperator at all? This is where the <code>-i</code> flag comes + into play. With the <code>-i</code> flag we can + get <code>mmv</code> to execute our mapping command for every + input filename. This means that as long as we can work with a + complete buffer, we don’t need to worry about seperators. + </p> + + <p> + To be honest, I cannot really think of any situation where you + might actually need to do this. If you can think of one, + please <a href="mailto:mail@thomasvoss.com">email me</a> and + I’ll update the example on this page. Regardless, let’s imagine + that we wanted to rename some files so that their filenames are + replaced with their filename + <a href="https://en.wikipedia.org/wiki/SHA-1" target="_blank"> + SHA-1 hash</a>. + On Linux we have the <code>sha1sum</code> program which reads + input from the standard input and outputs the SHA-1 hash. This + is how we would use it with <code>mmv</code>: + </p> + + <figure> + <pre>m4_fmt_code(sha1sum-long-example.sh.html)</pre> + </figure> + + <p> + Another approach is to invoke <code>mmv</code> twice: + </p> + + <figure> + <pre>m4_fmt_code(sha1sum-short-example.sh.html)</pre> + </figure> + + <p> + If you are confused about why we need to make a call + to <code>awk</code>, it’s because the <code>sha1sum</code> + program outputs 2 columns of data. The first column is our hash + and the second column is the filename where the to-be-hashed + data was read from. We don’t want the second column. + </p> + + <p> + Unlike in previous examples where one process was spawned to map + all our filenames, with the <code>-i</code> flag we are spawning + a new instance for each filename. If you struggle to visualize + this, perhaps the following diagrams help: + </p> + + <figure> + <figcaption>Invoking <code>mmv</code> without <code>-i</code></figcaption> + <object data="without-i-flag.svg" type="image/svg+xml"></object> + </figure> + + <figure> + <figcaption>Invoking <code>mmv</code> with <code>-i</code></figcaption> + <object data="with-i-flag.svg" type="image/svg+xml"></object> + </figure> + + <h2 id="safety">Safety</h2> + <p> + When compared to the standard <code>for f in *; do mv $f …; + done</code> or <code>ls | … | xargs -L2 mv</code> + constructs, <code>mmv</code> is significantly more safe to use. + These are some of the safety features that are built into the + tool: + </p> + + <ol> + <li> + If the number of input- and output files differs, execution + is aborted before making any changes. + </li> + <li> + If an input file is renamed to the name of another input + file, the second input file is not lost (i.e. you can rename + <em>a</em> to <em>b</em> and <em>b</em> to <em>a</em> with + no problem). + </li> + <li> + All input files must be unique and all output files must be + unique. Otherwise execution is aborted before making any + changes. + </li> + <li> + In the case that something goes wrong during execution + (perhaps you tried to move a file to a non-existant + directory, or a syscall failed), a backup of your input + files is saved automatically by <code>mmv</code> for + recovery. + </li> + </ol> + + <p> + Due to the way <code>mmv</code> handles #2, when things do go + wrong you may find that all of your input files have + disappeared. Don’t worry though, <code>mmv</code> takes a + backup of your code before doing anything. If you + run <code>mmv</code> with the <code>-v</code> option for verbose + output, you’ll notice it backing up your stuff in + the <code>$XDG_CACHE_DIR</code> directory: + </p> + + <figure> + <pre>m4_fmt_code(mmv-verbose.sh.html)</pre> + </figure> + + <p> + Upon successful execution + the <code>$XDG_CACHE_DIR/mmv/TIMESTAMP</code> directory will be + automatically removed, but it remains when things go wrong so + that you can recover any missing data. The names of the + backup-subdirectories in the <code>$XDG_CACHE_DIR/mmv</code> + directory are timestamps of when the directories were created. + This should make it easier for you to figure out which directory + you need to recover if you happen to have multiple of these. + </p> + + <h2 id="examples">Examples</h2> + + <aside> + <p> + All of these examples are ripped straight from + the <code>mmv(1)</code> manual page. If you + installed <code>mmv</code> through a package manager or + via <code>make install</code> then you should have the + manual installed on your system. + </p> + </aside> + + <p>Swap the files <em>foo</em> and <em>bar</em>:</p> + <figure> + <pre>m4_fmt_code(examples/swap.sh.html)</pre> + </figure> + + <p> + Rename all files in the current directory to use hyphens (‘-’) + instead of spaces: + </p> + <figure> + <pre>m4_fmt_code(examples/hyphens.sh.html)</pre> + </figure> + + <p> + Rename a given list of movies to use lowercase letters and + hyphens instead of uppercase letters and spaces, and number them + so that they’re properly ordered in globs (e.g. rename <em>The + Return of the King.mp4</em> to + <em>02-the-return-of-the-king.mp4</em>): + </p> + <figure> + <pre>m4_fmt_code(examples/number.sh.html)</pre> + </figure> + + <p> + Rename files interactively in your editor while encoding newline + into the literal string ‘<code>\n</code>’, making use + of <code><a href="https://linux.die.net/man/1/vipe" + target="_blank">vipe(1)</a></code> from <em>moreutils</em>: + </p> + <figure> + <pre>m4_fmt_code(examples/vipe.sh.html)</pre> + </figure> + + <p> + Rename all C source code- and header files in a git repository + to use snake_case instead of camelCase using + the <abbr class="gnu">GNU</abbr> + <code><a href="https://www.man7.org/linux/man-pages/man1/sed.1.html" + target="_blank">sed(1)</a></code> ‘<code>\n</code>’ extension: + </p> + <figure> + <pre>m4_fmt_code(examples/camel-to-snake.sh.html)</pre> + </figure> + + <p> + Lowercase all filenames within a directory hierarchy which may + contain newline characters: + </p> + <figure> + <pre>m4_fmt_code(examples/lowercase.sh.html)</pre> + </figure> + + <p> + Map filenames which may contain newlines in the current + directory with the command ‘<code>cmd</code>’, which itself does + not support nul-byte separated entries. This only works + assuming your mapping doesn’t require any context outside of the + given input filename (for example, you would not be able to + number your files as this requires knowledge of the input files + position in the input list): + </p> + <figure> + <pre>m4_fmt_code(examples/i-flag.sh.html)</pre> + </figure> + </main> + + <hr> + + <footer> + m4_footer + </footer> + </body> +</html> |