1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
|
m4_define(HL, ‘‘m4_patsubst($1, ‘‘<\([^>]*\)>’’, ‘‘@span .hl-red {=\1}’’)’’)m4_dnl
html lang="en" {
head { HEAD }
body {
header {
div {
h1 {-Reinvent The Wheel!}
INCLUDE(nav.gsp)
}
figure .quote {
blockquote {
p {=
You have to do what must be done. Nobody is going to ask you, “why
didn’t you make it?”. It’s either do it or not. Do not think about
what you’re feeling, do it no matter what.
}
}
figcaption {-Haroon Khan}
}
}
main {
h2 #story {-Story of a Software Engineer}
p {-
It was your average Wednesday afternoon, and I was working my job. My
specific task on this day was quite simple: document our custom Vue
components that make up most of our products UI.
}
p {-
This should be a relatively easy task and for the most part it was, but
I had an issue. Some of these components had some @em{-really} obscure
properties that could influence their behavior, and seeing as much of
the codebase was written 10 years ago by utter idiots, the code
implementing these properties is @em{-really} hard to read.
}
p {-
I decided that it would be quite a bit easier to instead of trying to
study the @em{-definitions} of these properties, to try to study the
@em{-usage} of these properties. But how do I find them? Our codebase
is hundreds of thousands of lines of code, and these properties have
very generic names such as ‘@em{-browser}’. Additionally while the
components are easy to search for, they’re used in hundreds of places
and such properties may only be used once or twice.
}
p {-
The solution? I thought it would be the trusty tool in every hackers
toolbelt: @code{-grep}.
}
h2 #downfall {-The Downfall of Grep}
p {-
I thought that @code{-grep} would be my saviour. The tool that would
answer the call to find me the usages I so desired. So I whipped that
baby out and went straight to work:
}
figure {
pre { FMT_CODE(grep.sh) }
}
p {-
You can probably tell from the fact I’m writing this post that this did
not work. If you’ve ever worked with Vue or something similar, you
might even be able to figure out why. For those unfamiliar with the
frontend (you’re a treasure that must be preserved), allow me to show
you something that is all too common in a Vue codebase:
}
figure {
pre .vue { FMT_CODE(example.vue) }
}
p {-
The issue here is clear: the property we’re searching for (‘browser’) is
on an entirely different line from the component we’re searching for
(‘@code{-<date-input>}’). It’s not enough to search for just the
component because it’s used everywhere but only a few rare usages
interest me, and it’s not enough to search for just the attribute
because many different components have attributes of the same name (and
@em{-no} they don’t have the same behavior; the codebase is shit).
}
p {-
What I need is a tool that will let me search for patterns that span
multiple lines.
}
h2 #grab {-Introducing Grab}
figure .quote {
blockquote {
p {=
The current UNIX® text processing tools are weakened by the built-in
concept of a line. There is a simple notation that can describe the
‘shape’ of files when the typical array-of-lines picture is
inadequate. That notation is regular expressions. Using regular
expressions to describe the structure in addition to the contents of
files has interesting applications, and yields elegant methods for
dealing with some problems the current tools handle clumsily. When
operations using these expressions are composed, the result is
reminiscent of shell pipelines.
}
}
figcaption {-Rob Pike}
}
p {-
That quote is from the abstract of @cite {-Structural Regular
Expressions}, a paper written by the one and only Rob Pike back in
1987. It describes an idea by which we stop assuming that all data is
organized in lines, and instead use regular expressions to define the
shapes comprising our data.
}
p {-
I actually had read this paper some years ago and it had always sat in
the back of my mind. I had actually toyed around in the past with an
implementation of @code{-grep} that wasn’t strictly line-oriented, but
it was very bare-bones, and lacked basic faculties such as reporting the
positions of matches, something I desperately needed.
}
p {-
So over the following few days I made major changes, rewrote lots of the
code, and overall turned my tool — @code{-grab} — into a staple part of
my hackers toolbelt.
}
h2 #how {-How Grab Finds Text}
p {-
If you’re familiar with the UNIX environment, you’re probably used to
querying text with tools such as @code{-sed} and @code{-awk} using
regular expressions. These are the same regular expressions we as
programmers all know and love, but with one important — yet often
overlooked — characteristic: you cannot match the newline.
}
p {-
The @code{-grab} utility moves away from this limiting paradigm; the
newline is treated no differently from another other character you want
to match. Want to match an entire paragraph of text? The pattern is as
simple as ‘@code{-[^\\n].+?(?=\\n\\n|$)}’. It may look
complicated if you’re new to regular expressions — PCREs to be specific
— but it’s really quite simple. You just match a non-newline character,
and then as many characters as possible until reaching either a double
newline, or the end of input.
}
p {-
On its own this isn’t too amazing though. The great thing of
@code{-grep} is that it doesn’t just show you matches, but it shows you
them in the context of a complete line. @code{-grab} solves this in the
same way described in Rob Pike’s paper: chaining operations.
}
p {-
Say we want to iterate not over lines but over paragraphs. We can use
the following @em{-pattern}:
}
figure {
pre { FMT_CODE(x.pat) }
}
p {-
Here we’re using the ‘x’ operator. It iterates over all occurrences of
the pattern. In this case we’re iterating over all paragraphs in our
input. Maybe we want to see all paragraphs which contain doubled words
(for example: ‘the the’), a common typo found in text files. For this
we can use the ‘g’ operator:
}
figure {
pre { FMT_CODE(g.pat) }
}
p {-
The fundamental difference between the two operators is that the
‘x’ operator specifies the structure to iterate over. In the context of
@code{-grep} these are lines, but in @code{-grab} they can be whatever
you want. The ‘g’ operator on the other hand doesn’t modify the
structure of the matches returned to you at all; it simply acts as a
filter selecting matches with match the given regular expression.
}
p {-
Here’s an interactive example:
}
figure {
pre { FMT_CODE(example-1.sh) }
}
p {-
This is almost perfect; there’s just one bit missing. In my interactive
example I’ve shown how you can use the power of @code{-grab} to find
paragraphs in your files containing doubled words. This is really handy
if you find yourself writing websites, documentation, or other long-form
written content.
}
p {-
Given my example though, how easily were you able to spot the doubled
words? It probably didn’t stick out to you right away, unlike if it had
been highlighted by some bright flashy color. It is for this reason
that the ‘h’ operator exists. This operator is unique in that it does
not change the given selections at all. Any matches made by previous
occurrences of ‘x' and ‘g’ will be displayed the same with and without
the use of ‘h’.
}
p {-
The ‘h’ operators is purely for the user. By using this operator you
can specify a pattern for which matching text must be @em{-highlighted}.
Let’s apply it to the previous example and see how the doubled words are
made instantly obvious to the user:
}
figure {
pre { HL(FMT_CODE(example-2.sh)) }
}
p {-
There is an obvious problem here: the duplication of the regular
expression provided to the ‘g’ and ‘h’ operators. It is @em{-extremely}
common that you will want to highlight text that was just matched by a
‘g’ operator. Like, @em{-really} common. So common in fact that the
‘h’ operator supports a shorthand syntax for this exact situation:
@code {-h//}. Giving an empty regular expression as an argument to an
operator is illegal with the exception of the ‘h’ operator. When this
operator is given an empty argument, it assumes the regular expression
of the previous operator:
}
figure {
pre { HL(FMT_CODE(example-3.sh)) }
}
h2 #final {-Final Solution}
p {-
So… what was the final solution to my problem? How did I find all the
@code{-<date-input>} tags in my jobs codebase that were passed the
‘browser’ attribute? Well here’s how:
}
figure {
pre { FMT_CODE(answer.sh) }
}
p {-
Quick, simple, and elegant. Just the way I like it!
}
h2 #more {-Additional Operators}
p {-
Here I’ve shown you the 3 main operators: ‘x’, ‘g’, and ‘h’. These are
not all however! Each operator also has a capital variant (‘X’, ‘G’,
‘H’) which behaves the same but instead of working on text that matches
the given pattern, these operators match on text which @em{-doesn’t}
match the given pattern.
}
p {-
These operators allow for better pattern matching. For example a
pattern to match all numbers which contain a ‘3’ but which aren’t ‘1337’
could be written as @code{-x/[0-9]+/ g/3/ G/^1337$/}.
}
}
footer { FOOT }
}
}
|