Add a post on ‘mmv’

author: Thomas Voss <mail@thomasvoss.com> 2023-08-15 14:57:32 +0200
committer: Thomas Voss <mail@thomasvoss.com> 2023-08-15 14:57:32 +0200
commit: d5635e946e9df6f519ec8cf08cebfc35dbe6c788 (patch)
tree: 46893cffdf23a2b15f8b7839c69d5df2bcbb8bca /src/prj/mmv/index.html
parent: cfa35dcb2d332977e80a5811b6d42e9949bd4814 (diff)
1 files changed, 658 insertions, 0 deletions
diff --git a/src/prj/mmv/index.html b/src/prj/mmv/index.html
new file mode 100644
index 0000000..d13f7c8
--- /dev/null
+++ b/src/prj/mmv/index.html
@@ -0,0 +1,658 @@
+<!DOCTYPE html>
+<html lang="en">
+	<head>
+		m4_include(head.html)
+	</head>
+	<body>
+		<header>
+			<div>
+				<h1>Moving Files the Right Way</h1>
+				m4_include(nav.html)
+			</div>
+
+			<figure class="quote">
+				<blockquote>
+					<p>I think the OpenBSD crowd is a bunch of masturbating
+					monkeys, in that they make such a big deal about
+					concentrating on security to the point where they pretty much
+					admit that nothing else matters to them.</p>
+				</blockquote>
+				<figcaption>
+					Linux Torvalds
+				</figcaption>
+			</figure>
+		</header>
+
+		<main>
+			<p>
+				<em>
+					You can find the <code>mmv</code> git repository over at
+					<a href="https://git.sr.ht/~mango/mmv"
+					   target="_blank">sourcehut</a>
+					   or <a href="https://github.com/Mango0x45/mmv">GitHub</a>.
+				</em>
+			</p>
+
+			<h2>Table of Contents</h2>
+
+			<ul>
+				<li><a href="#prologue">Prologue</a></li>
+				<li><a href="#moving">Advanced Moving and Pitfalls</a></li>
+				<li><a href="#mapping">Name Mapping with <code>mmv</code></a></li>
+				<li><a href="#newlines">Filenames with Embedded Newlines</a></li>
+				<ul>
+					<li><a href="#0-flag">The Simple Case</a></li>
+					<li><a href="#e-flag">Encoding Newlines</a></li>
+				</ul>
+				<li><a href="#i-flag">Individual Execution</a></li>
+				<li><a href="#safety">Safety</a></li>
+				<li><a href="#examples">Examples</a></li>
+			</ul>
+			
+			<h2 id="prologue">Prologue</h2>
+			<p>
+				File moving and renaming is one of the most common tasks we
+				undertake on the command-line.  We basically always do this with
+				the <code>mv</code> utility, and it gets the job done most of the
+				time.  Want to rename one file?  Use <code>mv</code>!  Want to
+				move a bunch of files into a directory?  Use <code>mv</code>!
+				How could mv ever go wrong?  Well I’m glad you asked!
+			</p>
+
+			<h2 id="moving">Advanced Moving and Pitfalls</h2>
+			<p>
+				Let’s start off nice and simple.  You just inherited a C project
+				that uses the sacrilegious
+				<a
+					href="https://en.wikipedia.org/wiki/Camel_case"
+					target="_blank"
+				>camelCase</a>
+				naming convention for its files:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(ls-files.sh.html)</pre>
+			</figure>
+
+			<p>
+				This deeply upsets you, as it upsets me.  So you decide you want
+				to switch all these files to use
+				<a
+					href="https://en.wikipedia.org/wiki/Snake_case"
+					target="_blank"
+				>snake_case</a>,
+				like a normal person.  Well how would you do this?  You use
+				<code>mv</code>!  This is what you might end up doing:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(manual-mv.sh.html)</pre>
+			</figure>
+
+			<p>
+				Well… it works I guess, but it’s a pretty shitty way of renaming
+				these files.  Luckily we only had 5, but what if this was a much
+				larger project with many more files to rename?  Things would get
+				tedious.  So instead we can use a pipeline for
+				this:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(camel-to-snake-naïve.sh.html)</pre>
+			</figure>
+
+			<aside>
+				<p>
+					The given example assumes your <code>sed</code>
+					implementation supports ‘<code>\L</code>’ which is a
+					non-standard <abbr class="gnu">GNU</abbr> extension.
+				</p>
+			</aside>
+
+			<p>
+				That works and it gets the job done, but it’s not really ideal is
+				it?  There are a couple of issues with this.
+			</p>
+
+			<ol>
+				<li>
+					<p>
+						You’re writing more complicated code.  This has the
+						obvious drawback of potentially being more error-prone,
+						but also risks taking more time to write than you’d like
+						as you might have forgotten if <code>xargs</code>
+						actually has an ‘<code>-L</code>’ option or not (which
+						would require reading the
+						<a href="https://www.man7.org/linux/man-pages/man1/xargs.1.html"
+							target="_blank" ><code>xargs(1)</code></a> manual).
+					</p>
+				</li>
+				<li>
+					<p>
+						If you try to rename the file <em>foo</em>
+						to <em>bar</em> but <em>bar</em> already exists, you end
+						up deleting a file you may not have wanted to.
+					</p>
+				</li>
+				<li>
+					<p>
+						In a similar vein to the previous point, you need to be
+						very careful about schemes like renaming the
+						file <em>a</em> to <em>b</em> and <em>b</em>
+						to <em>c</em>.  You run the risk of turning <em>a</em>
+						into <em>c</em> and losing the file <em>b</em> entirely.
+					</p>
+				</li>
+				<li>
+					<p>
+						Moving symbolic links is its own whole can of worms.  If
+						a symlink points to a relative location then you need to
+						make sure you keep pointing to the right place.  If the
+						symlink is absolute however then you can leave it
+						untouched.  But what if the symlink points to a file
+						that you’re moving as part of your batch move operation?
+						Now you need to handle that too.
+					</p>
+				</li>
+			</ol>
+
+			<h2 id="mapping">Name Mapping with <code>mmv</code></h2>
+
+			<p>
+				What is <code>mmv</code>?  It’s the solution to all your
+				problems, that’s what it is!  <code>mmv</code> takes as its
+				argument(s) a utility and that utilities arguments and uses that
+				to create a mapping between old and new filenames — similar to
+				the <code>map()</code> function found in many programming
+				languages.  I think to best convey how the tool functions, I
+				should provide an example.  Let’s try to do the same thing we did
+				previously where we tried to turn camelCase files to snake_case,
+				but using <code>mmv</code>:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(camel-to-snake-smart.sh.html)</pre>
+			</figure>
+
+			<p>Let me break down how this works.</p>
+
+			<p>
+				<code>mmv</code> starts by reading a series of filenames
+				separated by newlines from the standard input.  Yes, sometimes
+				filenames have newlines in them and yes there is a way to handle
+				them but I shall get to that later.  The filenames that
+				<code>mmv</code> reads from the standard input will be referred
+				to as the <em>input files</em>.  Once all the input files have
+				been read, the utility specified by the arguments is spawned; in
+				this case that would be <code>sed</code> with the argument
+				<code>'s/[A-Z]/\L_&/g'</code>. The input files are then piped
+				into <code>sed</code> the exact same way that they would have
+				been if we ran the above commands without <code>mmv</code>, and
+				the output of <code>sed</code> then forms what will be referred
+				to as the <em>output files</em>.  Once a complete list of output
+				files is accumulated, each input file gets renamed to its
+				corresponding output file.
+			</p>
+
+			<p>
+				Let’s look at a simpler example.  Say we want to rename 2 files
+				in the current directory to use lowercase letters, we could use
+				the following command:
+			</p>
+			
+			<figure>
+				<pre>m4_fmt_code(mmv-tr.sh.html)</pre>
+			</figure>
+
+			<p>
+				In the above example <code>mmv</code> reads 2 lines from
+				standard input, those being <em>LICENSE</em>
+				and <em>README</em>.  Those are our 2 input files now.
+				The <code>tr</code> utility is then spawned and the input files
+				are piped into it.  We can simulate this in the shell:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(tr.sh.html)</pre>
+			</figure>
+
+			<p>
+				As you can see above, <code>tr</code> has produced 2 lines of
+				output; these are our 2 output files.  Since we now have our 2
+				input files and 2 output files, <code>mmv</code> can go ahead
+				and rename the files.  In this case it will rename
+				<em>LICENSE</em> to <em>license</em> and
+				<em>README</em> to <em>readme</em>.  For some examples, check
+				the <a href="#examples">examples</a> section of this page down
+				below.
+			</p>
+
+			<h2 id="newlines">Filenames with Embedded Newlines</h2>
+
+			<p>
+				People are retarded, and as a result we have filenames with
+				newlines in them.  All it would have taken to solve this issue
+				for everyone was for literally <strong>anybody</strong> during
+				the early UNIX days to go “<em>hey, this is a bad idea!</em>”,
+				but alas, we must deal with this.  Newlines are of course not
+				the only special characters filenames can contain, but they are
+				the single most infuriating to deal with; the UNIX utilities all
+				being line-oriented really doesn’t work well with these files.
+			</p>
+
+			<p>
+				So how does <code>mmv</code> deal with special characters, and
+				newlines in particular?  Well it does so by providing the user
+				with the <code>-0</code> and <code>-e</code> flags:
+			</p>
+
+			<dl>
+				<dt><code>-0</code></dt>
+				<dd>
+					<p>
+						Tell <code>mmv</code> to expect its input to not be
+						separated by newlines (‘<code>\n</code>’), but by NUL
+						bytes (‘<code>\0</code>’).  NUL bytes are the only
+						characters not allowed in filenames besides forward
+						slashes, so they are an obvious choice for an
+						alternative separator.
+					</p>
+				</dd>
+				<dt><code>-e</code></dt>
+				<dd>
+					<p>
+						Encode newlines in filenames before passing them to the
+						provided utility.  Newline characters are replaced by the
+						literal string ‘<code>\n</code>’ and backslashes by the
+						literal string ‘<code>\\</code>’.  After processing, the
+						resulting output is decoded again.
+					</p>
+					<p>
+						If combined with the <code>-0</code> flag, then while
+						input will be read assuming a NUL-byte input-seperator,
+						the encoded input files will be written to the spawned
+						process newline-seperated.
+					</p>
+				</dd>
+			</dl>
+
+			<h3 id="0-flag">The Simple Case</h3>
+
+			<p>
+				In order to better understand these flags and how they work
+				let’s go though another example.  We have 2 files — one with and
+				one without an embedded newline — and our goal is to simply
+				reverse these filenames.  In this example I am going to be
+				displaying newlines in filenames with the “<code>$'\n'</code>”
+				syntax as this is how my shell displays embedded newlines.
+			</p>
+
+			<p>
+				We can start by just trying to naïvely pass these 2 files
+				to <code>mmv</code> and use <code>rev</code> to reverse the
+				names, but this doesn’t work:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(mmv-rev.sh.html)</pre>
+			</figure>
+
+			<p>
+			  The reason this doesn’t work is because due to the line-oriented
+			  nature of <code>ls</code> and <code>rev</code>, we are actually
+			  trying to rename the files <em>foo</em>, <em>bar</em>, and
+			  <em>baz</em> to the new filenames <em>zab</em>,
+			  <em>rab</em>, and <em>oof</em>.  As can be seen in the following
+			  diagram, the embedded newline is causing our input to be ambiguous
+			  and <code>mmv</code> can’t reliably proceed
+			  anymore <x-ref>1</x-ref>:
+			</p>
+
+			<figure>
+				<object data="conflict.svg" type="image/svg+xml"></object>
+			</figure>
+
+			<aside>
+				<p data-ref="1">
+					The reason you get a cryptic “file not found” error message
+					is because <code>mmv</code> tries to assert that all the
+					input files actually exist before doing anything.  Since
+					“foo” isn’t a real file, we error out.
+				</p>
+			</aside>
+			
+			<p>
+			  The first thing we need to do in order to proceed is to pass
+			  the <code>-0</code> flag to <code>mmv</code>.  This will
+			  tell <code>mmv</code> that we want to use the NUL-byte as our
+			  input separator and not the newline.  We also need <code>ls</code>
+			  to actually provide us with the filenames delimited by NUL-bytes.
+			  Luckily <abbr class="gnu">GNU</abbr> <code>ls</code> gives us the
+			  <code>--zero</code> flag to do just that:
+			</p>
+
+			<figure>
+			  <pre>m4_fmt_code(mmv-rev-zero.sh.html)</pre>
+			</figure>
+
+			<p>
+				So we’re getting places, but we aren’t quite there yet.  The
+				issue we’re getting now is that <code>mmv</code> recieved 2
+				input files from the standard input, but <code>rev</code>
+				produced 3 output files.  Why is that?  Well let’s try our hand
+				at a little bit of command-line debugging with <code>sed</code>:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(sed-debugging.sh.html)</pre>
+			</figure>
+
+			<p>
+				If you aren’t quite sure what the above is doing, here’s a quick
+				summary:
+			</p>
+
+			<ul>
+				<li>
+					The <code>-U</code> flag given to <code>ls</code> tells it
+					not to sort our output.  This is purely just to keep this
+					example clear to the reader.
+				</li>
+				<li>
+					The <code>-n</code> flag given to <code>sed</code> tells it
+					not to print the input line automatically at the end of the
+					provided script.
+				</li>
+				<li>
+					The <code>l</code> command in <code>sed</code> prints the
+					current input in a “visually unambiguous form”.
+				</li>
+			</ul>
+
+			<p>
+				In the <code>sed</code> output, we can see that <samp>$</samp>
+				represents the end of a line, and <samp>\000</samp> represents
+				the NUL-byte.  All looks good here, we have two inputs seperated
+				by NUL-bytes.  Now let’s try to throw in <code>rev</code>:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(sed-debugging-rev.sh.html)</pre>
+			</figure>
+
+			<p>
+				Well wouldn’t you know it?  Since <code>rev</code> <em>also</em>
+				works with newline-seperated input, it reversed out NUL-byte
+				seperators and now gives us 3 outputs.  Luckily the folks over
+				at <em>util-linux</em> provided us with the <code>-0</code> flag
+				here too, so that we can properly handle NUL-delimited input.
+				Combining all of this together we get a final working product:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(reverse-embedded-newline.sh.html)</pre>
+			</figure>
+
+			<h3 id="e-flag">Encoding Newlines</h3>
+
+			<p>
+				Sometimes we want to rename a bunch of files, but the command we
+				want to use doesn’t support NUL-bytes as nicely as we would
+				like.  In these cases, you may want to consider encoding your
+				newline characters into the literal string ‘<code>\n</code>’ and
+				then passing your input newline-seperated to your given command
+				with the <code>-e</code> flag.
+			</p>
+
+			<p>
+				For a real-world example, perhaps you want to edit some
+				filenames in vim, or whatever other editor you use.  Well we can
+				do this incredibly easily with the <code>vipe</code> utility
+				from
+				the <a href="https://joeyh.name/code/moreutils/">moreutils</a>
+				collection.  The <code>vipe</code> command simply reads input
+				from the standard input, opens it up in your editor, and then
+				prints the resulting output to the standard output; perfect
+				for <code>mmv</code>!  We do not really want to deal with
+				NUL-bytes in our text-editor though, so let’s just encode our
+				newlines:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(vipe.sh.html)</pre>
+			</figure>
+
+			<aside>
+				<p>
+					Notice how you still need to pass the <code>-0</code> flag
+					to <code>mmv</code> know that our inputfiles may have
+					embedded newlines.
+				</p>
+			</aside>
+
+			<p>
+				When running the above code example, you will see the following
+				in your editor:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(vim.html)</pre>
+			</figure>
+
+			<p>
+				After you exit your editor, <code>mmv</code> will decode all
+				occurances of ‘<code>\n</code>’ back into a newline, and all
+				occurances of ‘<code>\\</code>’ back into a backslash:
+			</p>
+
+			<figure>
+				<object data="e-flag.svg" type="image/svg+xml"></object>
+			</figure>
+
+			<h2 id="i-flag">Individual Execution</h2>
+			<p>
+				The previous examples are great and all, but what do you do if
+				your mapping command doesn’t have the concept of an input
+				seperator at all?  This is where the <code>-i</code> flag comes
+				into play.  With the <code>-i</code> flag we can
+				get <code>mmv</code> to execute our mapping command for every
+				input filename.  This means that as long as we can work with a
+				complete buffer, we don’t need to worry about seperators.
+			</p>
+
+			<p>
+				To be honest, I cannot really think of any situation where you
+				might actually need to do this.  If you can think of one,
+				please <a href="mailto:mail@thomasvoss.com">email me</a> and
+				I’ll update the example on this page.  Regardless, let’s imagine
+				that we wanted to rename some files so that their filenames are
+				replaced with their filename
+				<a href="https://en.wikipedia.org/wiki/SHA-1" target="_blank">
+					SHA-1 hash</a>.
+				On Linux we have the <code>sha1sum</code> program which reads
+				input from the standard input and outputs the SHA-1 hash.  This
+				is how we would use it with <code>mmv</code>:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(sha1sum-long-example.sh.html)</pre>
+			</figure>
+
+			<p>
+				Another approach is to invoke <code>mmv</code> twice:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(sha1sum-short-example.sh.html)</pre>
+			</figure>
+
+			<p>
+				If you are confused about why we need to make a call
+				to <code>awk</code>, it’s because the <code>sha1sum</code>
+				program outputs 2 columns of data.  The first column is our hash
+				and the second column is the filename where the to-be-hashed
+				data was read from.  We don’t want the second column.
+			</p>
+
+			<p>
+				Unlike in previous examples where one process was spawned to map
+				all our filenames, with the <code>-i</code> flag we are spawning
+				a new instance for each filename.  If you struggle to visualize
+				this, perhaps the following diagrams help:
+			</p>
+
+			<figure>
+				<figcaption>Invoking <code>mmv</code> without <code>-i</code></figcaption>
+				<object data="without-i-flag.svg" type="image/svg+xml"></object>
+			</figure>
+
+			<figure>
+				<figcaption>Invoking <code>mmv</code> with <code>-i</code></figcaption>
+				<object data="with-i-flag.svg" type="image/svg+xml"></object>
+			</figure>
+
+			<h2 id="safety">Safety</h2>
+			<p>
+				When compared to the standard <code>for f in *; do mv $f …;
+				done</code> or <code>ls | … | xargs -L2 mv</code>
+				constructs, <code>mmv</code> is significantly more safe to use.
+				These are some of the safety features that are built into the
+				tool:
+			</p>
+
+			<ol>
+				<li>
+					If the number of input- and output files differs, execution
+					is aborted before making any changes.
+				</li>
+				<li>
+					If an input file is renamed to the name of another input
+					file, the second input file is not lost (i.e. you can rename
+					<em>a</em> to <em>b</em> and <em>b</em> to <em>a</em> with
+					no problem).
+				</li>
+				<li>
+					All input files must be unique and all output files must be
+					unique. Otherwise execution is aborted before making any
+					changes.
+				</li>
+				<li>
+					In the case that something goes wrong during execution
+					(perhaps you tried to move a file to a non-existant
+					directory, or a syscall failed), a backup of your input
+					files is saved automatically by <code>mmv</code> for
+					recovery.
+				</li>
+			</ol>
+
+			<p>
+				Due to the way <code>mmv</code> handles #2, when things do go
+				wrong you may find that all of your input files have
+				disappeared.  Don’t worry though, <code>mmv</code> takes a
+				backup of your code before doing anything.  If you
+				run <code>mmv</code> with the <code>-v</code> option for verbose
+				output, you’ll notice it backing up your stuff in
+				the <code>$XDG_CACHE_DIR</code> directory:
+			</p>
+
+			<figure>
+				<pre>m4_fmt_code(mmv-verbose.sh.html)</pre>
+			</figure>
+
+			<p>
+				Upon successful execution
+				the <code>$XDG_CACHE_DIR/mmv/TIMESTAMP</code> directory will be
+				automatically removed, but it remains when things go wrong so
+				that you can recover any missing data.  The names of the
+				backup-subdirectories in the <code>$XDG_CACHE_DIR/mmv</code>
+				directory are timestamps of when the directories were created.
+				This should make it easier for you to figure out which directory
+				you need to recover if you happen to have multiple of these.
+			</p>
+			
+			<h2 id="examples">Examples</h2>
+
+			<aside>
+				<p>
+					All of these examples are ripped straight from
+					the <code>mmv(1)</code> manual page. If you
+					installed <code>mmv</code> through a package manager or
+					via <code>make install</code> then you should have the
+					manual installed on your system.
+				</p>
+			</aside>
+
+			<p>Swap the files <em>foo</em> and <em>bar</em>:</p>
+			<figure>
+				<pre>m4_fmt_code(examples/swap.sh.html)</pre>
+			</figure>
+
+			<p>
+				Rename all files in the current directory to use hyphens (‘-’)
+				instead of spaces:
+			</p>
+			<figure>
+				<pre>m4_fmt_code(examples/hyphens.sh.html)</pre>
+			</figure>
+
+			<p>
+				Rename a given list of movies to use lowercase letters and
+				hyphens instead of uppercase letters and spaces, and number them
+				so that they’re properly ordered in globs (e.g. rename <em>The
+				Return of the King.mp4</em> to
+				<em>02-the-return-of-the-king.mp4</em>):
+			</p>
+			<figure>
+				<pre>m4_fmt_code(examples/number.sh.html)</pre>
+			</figure>
+
+			<p>
+				Rename files interactively in your editor while encoding newline
+				into the literal string ‘<code>\n</code>’, making use
+				of <code><a href="https://linux.die.net/man/1/vipe"
+				target="_blank">vipe(1)</a></code> from <em>moreutils</em>:
+			</p>
+			<figure>
+				<pre>m4_fmt_code(examples/vipe.sh.html)</pre>
+			</figure>
+
+			<p>
+				Rename all C source code- and header files in a git repository
+				to use snake_case instead of camelCase using
+				the <abbr class="gnu">GNU</abbr>
+				<code><a href="https://www.man7.org/linux/man-pages/man1/sed.1.html"
+				target="_blank">sed(1)</a></code> ‘<code>\n</code>’ extension:
+			</p>
+			<figure>
+				<pre>m4_fmt_code(examples/camel-to-snake.sh.html)</pre>
+			</figure>
+
+			<p>
+				Lowercase all filenames within a directory hierarchy which may
+				contain newline characters:
+			</p>
+			<figure>
+				<pre>m4_fmt_code(examples/lowercase.sh.html)</pre>
+			</figure>
+
+			<p>
+				Map filenames which may contain newlines in the current
+				directory with the command ‘<code>cmd</code>’, which itself does
+				not support nul-byte separated entries.  This only works
+				assuming your mapping doesn’t require any context outside of the
+				given input filename (for example, you would not be able to
+				number your files as this requires knowledge of the input files
+				position in the input list):
+			</p>
+			<figure>
+				<pre>m4_fmt_code(examples/i-flag.sh.html)</pre>
+			</figure>
+		</main>
+
+		<hr>
+			
+		<footer>
+			m4_footer
+		</footer>
+	</body>
+</html>
author	Thomas Voss <mail@thomasvoss.com>	2023-08-15 14:57:32 +0200
committer	Thomas Voss <mail@thomasvoss.com>	2023-08-15 14:57:32 +0200
commit	d5635e946e9df6f519ec8cf08cebfc35dbe6c788 (patch)
tree	46893cffdf23a2b15f8b7839c69d5df2bcbb8bca /src/prj/mmv/index.html
parent	cfa35dcb2d332977e80a5811b6d42e9949bd4814 (diff)