summaryrefslogtreecommitdiffhomepage
path: root/src/blog/grab/index.gsp
blob: 92d97ea13f177c15f67aaeb0a8f8009b26831351 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
html lang="en" {
	head { m4_include(head.gsp) }
	body {
		header {
			div {
				h1 {-Reinvent The Wheel!}
				m4_include(nav.gsp)
			}

			figure .quote {
				blockquote {
					p {=
						You have to do what must be done.  Nobody is going to ask you, “why
						didn’t you make it?”.  It’s either do it or not.  Do not think about
						what you’re feeling, do it no matter what.
					}
				}
				figcaption {-Haroon Khan}
			}
		}

		main {
			h2 #story {-Story of a Software Engineer}
			p {-
				It was your average Wednesday afternoon, and I was working my job.  My
				specific task on this day was quite simple: document our custom Vue
				components that make up most of our products m4_abbr(UI).
			}

			p {-
				This should be a relatively easy task and for the most part it was, but
				I had an issue.  Some of these components had some @em{-really} obscure
				properties that could influence their behavior, and seeing as much of
				the codebase was written 10 years ago by utter idiots, the code
				implementing these properties is @em{-really} hard to read.
			}

			p {-
				I decided that it would be quite a bit easier to instead of trying to
				study the @em{-definitions} of these properties, to try to study the
				@em{-usage} of these properties.  But how do I find them?  Our codebase
				is hundreds of thousands of lines of code, and these properties have
				very generic names such as ‘@em{-browser}’.  Additionally while the
				components are easy to search for, they’re used in hundreds of places
				and such properties may only be used once or twice.
			}

			p {-
				The solution?  I thought it would be the trusty tool in every hackers
				toolbelt: @code{-grep}.
			}

			h2 #downfall {-The Downfall of Grep}
			p {-
				I thought that @code{-grep} would be my saviour.  The tool that would
				answer the call to find me the usages I so desired.  So I whipped that
				baby out and went straight to work:
			}

			figure {
				pre {= m4_fmt_code(grep.sh.gsp) }
			}

			p {-
				You can probably tell from the fact I’m writing this post that this did
				not work.  If you’ve ever worked with Vue or something similar, you
				might even be able to figure out why.  For those unfamiliar with the
				frontend (you’re a treasure that must be preserved), allow me to show
				you something that is all too common in a Vue codebase:
			}

			figure {
				pre .vue {= m4_fmt_code(example.vue.gsp) }
			}

			p {-
				The issue here is clear: the property we’re searching for (‘browser’) is
				on an entirely different line from the component we’re searching for
				(‘@code{-<date-input>}’).  It’s not enough to search for just the
				component because it’s used everywhere but only a few rare usages
				interest me, and it’s not enough to search for just the attribute
				because many different components have attributes of the same name (and
				@em{-no} they don’t have the same behavior; the codebase is shit).
			}

			p {-
				What I need is a tool that will let me search for patterns that span
				multiple lines.
			}

			h2 #grab {-Introducing Grab}
			figure .quote {
				blockquote {
					p {=
						The current UNIX® text processing tools are weakened by the built-in
						concept of a line.  There is a simple notation that can describe the
						‘shape’ of files when the typical array-of-lines picture is
						inadequate.  That notation is regular expressions.  Using regular
						expressions to describe the structure in addition to the contents of
						files has interesting applications, and yields elegant methods for
						dealing with some problems the current tools handle clumsily.  When
						operations using these expressions are composed, the result is
						reminiscent of shell pipelines.
					}
				}
				figcaption {-Rob Pike}
			}

			p {-
				That quote is from the abstract of @cite {-Structural Regular
					Expressions}, a paper written by the one and only Rob Pike back in
				1987.  It describes an idea by which we stop assuming that all data is
				organized in lines, and instead use regular expressions to define the
				shapes comprising our data.
			}

			p {-
				I actually had read this paper some years ago and it had always sat in
				the back of my mind.  I had actually toyed around in the past with an
				implementation of @code{-grep} that wasn’t strictly line-oriented, but
				it was very bare-bones, and lacked basic faculties such as reporting the
				positions of matches, something I desperately needed.
			}

			p {-
				So over the following few days I made major changes, rewrote lots of the
				code, and overall turned my tool — @code{-grab} — into a staple part of
				my hackers toolbelt.
			}

			h2 #how {-How Grab Finds Text}
			p {-
				If you’re familiar with the UNIX environment, you’re probably used to
				querying text with tools such as @code{-sed} and @code{-awk} using
				regular expressions.  These are the same regular expressions we as
				programmers all know and love, but with one important — yet often
				overlooked — characteristic: you cannot match the newline.
			}

			p {-
				The @code{-grab} utility moves away from this limiting paradigm; the
				newline is treated no differently from another other character you want
				to match.  Want to match an entire paragraph of text?  The pattern is as
				simple as ‘@code{-[^\\n].‌+?(?=\\n\\n|$)}’.  It may look
				complicated if you’re new to regular expressions — m4_abbr(PCRE)s to be
				specific — but it’s really quite simple.  You just match a non-newline
				character, and then as many characters as possible until reaching either
				a double newline, or the end of input.
			}

			p {-
				On its own this isn’t too amazing though.  The great thing of
				@code{-grep} is that it doesn’t just show you matches, but it shows you
				them in the context of a complete line.  @code{-grab} solves this in the
				same way described in Rob Pike’s paper: chaining operations.
			}

			p {-
				Say we want to iterate not over lines but over paragraphs.  We can use
				the following @em{-pattern}:
			}

			figure {
				pre {= m4_fmt_code(x.pat.gsp) }
			}

			p {-
				Here we’re using the ‘x’ operator.  It iterates over all occurrences of
				the pattern.  In this case we’re iterating over all paragraphs in our
				input.  Maybe we want to see all paragraphs which contain doubled words
				(for example: ‘the the’), a common typo found in text files.  For this
				we can use the ‘g’ operator:
			}

			figure {
				pre {= m4_fmt_code(g.pat.gsp) }
			}

			p {-
				The fundamental difference between the two operators is that the
				‘x’ operator specifies the structure to iterate over.  In the context of
				@code{-grep} these are lines, but in @code{-grab} they can be whatever
				you want.  The ‘g’ operator on the other hand doesn’t modify the
				structure of the matches returned to you at all; it simply acts as a
				filter selecting matches with match the given regular expression.
			}

			p {-
				Here’s an interactive example:
			}

			figure {
				pre {= m4_fmt_code(example-1.sh.gsp) }
			}

			p {-
				This is almost perfect; there’s just one bit missing.  In my interactive
				example I’ve shown how you can use the power of @code{-grab} to find
				paragraphs in your files containing doubled words.  This is really handy
				if you find yourself writing websites, documentation, or other long-form
				written content.
			}

			p {-
				Given my example though, how easily were you able to spot the doubled
				words?  It probably didn’t stick out to you right away, unlike if it had
				been highlighted by some bright flashy color.  It is for this reason
				that the ‘h’ operator exists.  This operator is unique in that it does
				not change the given selections at all.  Any matches made by previous
				occurrences of ‘x' and ‘g’ will be displayed the same with and without
				the use of ‘h’.
			}

			p {-
				The ‘h’ operators is purely for the user.  By using this operator you
				can specify a pattern for which matching text must be @em{-highlighted}.
				Let’s apply it to the previous example and see how the doubled words are
				made instantly obvious to the user:
			}

			figure {
				pre {= m4_fmt_code(example-2.sh.gsp) }
			}

			p {-
				There is an obvious problem here: the duplication of the regular
				expression provided to the ‘g’ and ‘h’ operators.  It is @em{-extremely}
				common that you will want to highlight text that was just matched by a
				‘g’ operator.  Like, @em{-really} common.  So common in fact that the
				‘h’ operator supports a shorthand syntax for this exact situation:
				@code {-h//}.  Giving an empty regular expression as an argument to an
				operator is illegal with the exception of the ‘h’ operator.  When this
				operator is given an empty argument, it assumes the regular expression
				of the previous operator:
			}

			figure {
				pre {= m4_fmt_code(example-3.sh.gsp) }
			}

			h2 #final {-Final Solution}
			p {-
				So… what was the final solution to my problem?  How did I find all the
				@code{-<date-input>} tags in my jobs codebase that were passed the
				‘browser’ attribute?  Well here’s how:
			}

			figure {
				pre {= m4_fmt_code(answer.sh.gsp) }
			}

			p {-
				Quick, simple, and elegant.  Just the way I like it!
			}

			h2 #more {-Additional Operators}
			p {-
				Here I’ve shown you the 3 main operators: ‘x’, ‘g’, and ‘h’.  These are
				not all however!  Each operator also has a capital variant (‘X’, ‘G’,
				‘H’) which behaves the same but instead of working on text that matches
				the given pattern, these operators match on text which @em{-doesn’t}
				match the given pattern.
			}

			p {-
				These operators allow for better pattern matching.  For example a
				pattern to match all numbers which contain a ‘3’ but which aren’t ‘1337’
				could be written as @code{-x/[0-9]+/ g/3/ G/^1337$/}.
			}
		}

		hr{}
		
		footer { m4_footer }
	}
}