Verifying binaries downloaded from the internet

Frequently we download binaries from the internet. How do we know they are genuine and not been tampered with? (Genuine as in genuine auto parts). This post illustrates the process using the JMeter library as example and downloading to MacOS. Steps on Linux should be similar. The download page for JMeter is here. It stresses not once, but twice that you must verify the integrity of the file.

and then again

This is how the process works. In below it is assumed you have wget and gpg programs installed on your computer. If you don’t, install them first before moving on. There are three files you will need to download. Refer below for convenience:

Step 1: Download the binary

➜   wget https://dlcdn.apache.org//jmeter/binaries/apache-jmeter-5.5.tgz
--2022-06-22 10:51:40--  https://dlcdn.apache.org//jmeter/binaries/apache-jmeter-5.5.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 2a04:4e42::644, 151.101.2.132
Connecting to dlcdn.apache.org (dlcdn.apache.org)|2a04:4e42::644|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85476161 (82M) [application/x-gzip]
Saving to: ‘apache-jmeter-5.5.tgz’

apache-jmeter-5.5.tgz             100%[==========================================================>]  81.52M  10.4MB/s    in 8.0s

2022-06-22 10:51:48 (10.2 MB/s) - ‘apache-jmeter-5.5.tgz’ saved [85476161/85476161]

Step 2: Download the signature file

In case of JMeter it is under the PGP link. Tip: PGP signature file ends in .asc

➜   wget https://www.apache.org/dist/jmeter/binaries/apache-jmeter-5.5.tgz.asc
--2022-06-22 10:55:41--  https://www.apache.org/dist/jmeter/binaries/apache-jmeter-5.5.tgz.asc
Resolving www.apache.org (www.apache.org)... 2a04:4e42::644, 151.101.2.132
Connecting to www.apache.org (www.apache.org)|2a04:4e42::644|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://downloads.apache.org/jmeter/binaries/apache-jmeter-5.5.tgz.asc [following]
--2022-06-22 10:55:41--  https://downloads.apache.org/jmeter/binaries/apache-jmeter-5.5.tgz.asc
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 853 [text/plain]
Saving to: ‘apache-jmeter-5.5.tgz.asc’

apache-jmeter-5.5.tgz.asc         100%[==========================================================>]     853  --.-KB/s    in 0s

2022-06-22 10:55:42 (407 MB/s) - ‘apache-jmeter-5.5.tgz.asc’ saved [853/853]

Step 3: Download the file containing the public keys

The authenticity of a file is established by verifying the digital signature associated with it. To verify the digital signature we downloaded in previous step, you also need the public key. The public key decrypts the signature. The JMeter page says:

The KEYS link links to the code signing keys used to sign the product.

Download the keys:

➜   wget https://downloads.apache.org/jmeter/KEYS
--2022-06-22 10:54:50--  https://downloads.apache.org/jmeter/KEYS
Resolving downloads.apache.org (downloads.apache.org)... 2a01:4f8:10a:201a::2, 135.181.214.104, 88.99.95.219, ...
Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:201a::2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13665 (13K)
Saving to: ‘KEYS.1’

KEYS.1                            100%[==========================================================>]  13.34K  70.5KB/s    in 0.2s

2022-06-22 10:54:51 (70.5 KB/s) - ‘KEYS.1’ saved [13665/13665]

In my case I had downloaded some keys previously into a file named KEYS so this file got downloaded as KEYS.1 on my computer

Step 4: Import the keys into gpg

➜   gpg --import KEYS.1
gpg: key 5A72D515FE99D106: public key "Mike Stover <mstover1@apache.org>" imported
gpg: key D5334E75B1313DE2: public key "Robert Burrell Donkin (CODE SIGNING KEY) <rdonkin@apache.org>" imported
gpg: key 2CDF847D3A01713B: public key "Michael Stover (APACHE CODE_SIGNING KEY) <mstover1@apache.org>" imported
gpg: key 95E965D25DB8AC0F: public key "Peter Lin (Peter Lin JMeter committer) <woolfel@apache.org>" imported
gpg: key 3FE0C161BCA973AC: public key "Sebastian Bazley (ASF Signing Key) <sebb@apache.org>" imported
gpg: key 7A8860944FAD5F62: public key "Sebastian Bazley (ASF CODE SIGNING KEY) <sebb@apache.org>" imported
gpg: key AC214CAA0612B399: public key "Milamber (ASF) <milamber@apache.org>" imported
gpg: Total number processed: 7
gpg:               imported: 7
gpg: no ultimately trusted keys found

The message gpg: no ultimately trusted keys found means none of the keys are in the web of trust. The web of trust is central to how security works in PGP. Read about it in further reading section.

Step 5: Now verify the signature

➜   gpg --verify apache-jmeter-5.5.tgz.asc apache-jmeter-5.5.tgz
gpg: Signature made Fri Jun 10 09:52:06 2022 PDT
gpg:                using RSA key C4923F9ABFB2F1A06F08E88BAC214CAA0612B399
gpg: Good signature from "Milamber (ASF) <milamber@apache.org>" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: C492 3F9A BFB2 F1A0 6F08  E88B AC21 4CAA 0612 B399

What this means? The signature is valid. Its coming from Milamber (ASF) milamber@apache.org whose public key we imported in Step 4. But since that key is not trusted (why? because its not in our web of trust), we get this message gpg: WARNING: This key is not certified with a trusted signature! What is warning is saying is that there is no path in our web of trust that leads to milamber. Let’s take example of LinkedIn. Have you viewed a profile and noticed LinkedIn say’s its a 3rd degree connection? What that means is that you know someone who knows someone who knows the person you are viewing. You -> know someone (1st degree connection) -> knows someone (2nd degree connection) -> 3rd degree connection. Well in case of milamber there is no chain – no matter how long – that starts from us and ends at milamber. So milamber is not in our web of trust.

Step 6: Verify the fingerprint – Don’t skip this

➜   gpg --fingerprint C4923F9ABFB2F1A06F08E88BAC214CAA0612B399
pub   rsa4096 2010-08-14 [SC]
      C492 3F9A BFB2 F1A0 6F08  E88B AC21 4CAA 0612 B399
uid           [ unknown] Milamber (ASF) <milamber@apache.org>

Verify the fingerprint you get C492 3F9A BFB2 F1A0 6F08 E88B AC21 4CAA 0612 B399 matches the fingerprint in Step 5.

We are done now – but read the full article for a caveat emptor. To add the key to our web of trust, we can either sign it ourselves (equivalent to self-endorsement – see this for details) or we will have to find someone in our web of trust who trusts that key. Left as exercise.

What happens if file is not genuine?

Imagine a hacker were to replace the genuine file with a tampered version in which they have injected some malicious code (e.g., a virus). In that case the command

➜   gpg --verify apache-jmeter-5.5.tgz.asc apache-jmeter-5.5.tgz

will fail as the signature in .asc file will no longer be valid. Try for yourself as exercise – tamper the file apache-jmeter-5.5.tgz ever so slightly and run the command. What do you get? This is layer 1 of security. So if the hacker somehow manages to replace the genuine file with a tampered version, they also need to update the .asc signature file. Well, what if someone succeeded in that? This is what happens in that case:

$ gpg --verify apache-jmeter-5.5.tgz.asc apache-jmeter-5.5.tgz
gpg: Signature made Fri Jun 10 09:52:06 2022 PDT
gpg:                using RSA key C4923F9ABFB2F1A06F08E88BAC214CAA0612B399
gpg: Can't check signature: No public key

The signature fails to validate again. This is layer 2 of security. So to fully pass the security checks, the hacker has to succeed in modifying all three files on the server – the main file (apache-jmeter-5.5.tgz), the signature file (apache-jmeter-5.5.tgz.asc) and the file containing the public keys (KEYS) – they have to add their public key to the KEYS file. As best practice, these three files should be hosted on different servers because if they are hosted on the same server and the hacker gets access to that server with sufficient privileges, then its not a big challenge to modify all three files.

Suppose a hacker is able to modify all three files. In that case, this is your warning that not all may be well. Call it layer 3 of security:

gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.

What this is saying is that the signature is by a person who is not in your web of trust – imagine receiving a signed paper document from someone you don’t know or trust. At this point most people assume everything is well – who’s going to bother with this:

When we ignore above warning, in essence we are assuming the KEYS file is bullet-proof and cannot be hacked. Is this a good assumption? This is your caveat emptor.

References

Posted in Computers, programming, Software | Tagged , | Leave a comment

The right way to implement producer consumer pattern in Java

The producer consumer pattern can be “easily” implemented in Java using one of the BlockingQueue classes. The producer puts work into the queue and consumer takes work from the queue. We start two separate threads for the producer and consumer. After that we just have to wait for the threads to complete. How do we do that? You might think of starting with this:

Just replace the Runnable with your producer and consumer. This answer has been vetted by SO and has a whopping 144 votes. What could go wrong? Well, this code works fine as long as we stick to the happy path. By happy path I mean the situation when there are no runtime exceptions. Things happen as expected.

But think for a moment what will this code do when the unhappy path is encountered – as it often happens not during local testing but when code is deployed to production. Imagine the producer (thread) crashes interim. The consumer (thread) will keep on waiting indefinitely because it doesn’t know the producer has crashed and it is waiting for the final message from producer to signal there is no more work to be done. The call to es.awaitTermination will never return (assuming your timeout was infinite). Yes, that is how ugly it can get. The awaitTermination blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first. Remember that if an exception happens in the consumer, it is not the same as an exception happening in the thread that runs awaitTermination.

You might think if awaitTermination could be somehow modified so that it blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or any task encounters an unhandled exception, whichever happens first. That might solve it. No, it won’t. Want to know why? In that case, when the consumer crashes, the call to awaitTermination will return and let’s say your main thread exits. In Java, your program can keep on running even if the main thread has exited! So the consumer thread still keeps running waiting indefinitely for work to arrive from the producer and keeps your program alive.

These are not some artificially concocted cases. They are pathological for sure but I have seen them happen first-hand. The difficulty in implementing the producer consumer pattern involves handling all these exceptional cases properly. What should we do if consumer crashes? What if producer crashes?

There are so many ways to get it wrong that I will just give the final answer on how to do it right. In below Work will be a class that you will write which encapsulates the work to do. When a consumer receives Work.EMPTY that tells the consumer there is no more work to be done. We make use of a ArrayBlockingQueue which takes a lot of the pain away in implementing the producer-consumer pattern. It is the very heart and soul of the producer consumer pattern. We only create a single producer and a single consumer in the code below but think how will you modify it if you have 100 producers and 100 consumers.

private static Throwable waitAny(CompletableFuture<?>... cfs) {     
        try {
            CompletableFuture.anyOf(cfs).join();
        } catch (Exception e) {
             e.printStackTrace();
             return e;
        }
        return null;
    }

public static void main(String[] args) {
        try {
            BlockingQueue<Work> sharedQueue = new ArrayBlockingQueue<>(10);
            Producer producer = new Producer(sharedQueue, producerArgs);
            Consumer producer = new Consumer(sharedQueue, consumerArgs);
            ExecutorService es = Executors.newCachedThreadPool();
            CompletableFuture<?>[] futures = new CompletableFuture<?>[2];
            CompletableFuture<?> producerTask = CompletableFuture.runAsync(producer, es);
            CompletableFuture<?> consumerTask = CompletableFuture.runAsync(consumer, es);
            futures[0] = producerTask;
            futures[1] = consumerTask;            
            Throwable t = waitAny(futures); // wait until one of the tasks completes
            if (producerTask.isDone()) {
                if (producerTask.isCompletedExceptionally()) { // Returns true if this CompletableFuture completed
                                                               // exceptionally, in any way. This includes cancellation
                                                               // of the task.
                    // unhappy path
                    // we want to wait for consumer to finish
                    // consumer will not finish until it gets an empty Work
                    if (!consumerTask.isDone()) {
                        sharedQueue.put(Work.EMPTY);
                    }
                }
                consumerTask.get(); // wait for consumer to finish
            } else {
                // in this case the consumer finished first
                assert consumerTask.isDone();
                if (consumerTask.isCompletedExceptionally()) {
                    // unhappy code path
                    // if consumer has failed, no point in letting the producer run
                    // if we don't cancel the producer, it will keep waiting indefinitely
                    // for sharedQueue to become available so it can queue more work for the consumer
                    // (the producer does not know that the consumer has crashed)
                    producerTask.cancel(true);
                }
                producerTask.get(); // wait for producer to finish
            }
            if (t != null) {
                // do you understand why we need to do this?
                System.exit(1);
            }
        } catch (Exception e) {
            e.printStackTrace();
            // without this, the program can still keep running even though main thread has
            // exited. Its Java.
            System.exit(1);
        }
}

Your producer and consumer classes should implement the Runnable interface. e.g.:

class Consumer implements Runnable {

    @Override
    public void run() {
    }
}

In the run method you will call the put and take methods of BlockingQueue (producer calls put to put Work into the queue and consumer calls take to take Work from the queue) that can throw the checked InterruptedException. Do not swallow the exception! Re-throw it.

class Consumer implements Runnable {

    @Override
    public void run() {
        try {
           ...
        } catch (InterrupedException e) {
               throw new RuntimeException(e);
        }
   }
}

And that’s how I learned to do it right. This post does not detail all the mistakes and hundreds of ways I did it wrong before arriving on the correct solution. Believe me, it was hard to get this right. And it would be 10x harder if we did not have the BlockingQueue class. Once again, most of the time we only test the happy path (as exemplified in the SO answer at beginning of the post) and thus never discover the hidden flaws in our code. I urge you to try the unhappy path in your software applications and see for yourself if your code can handle it.

UPDATE (2022/08/11): Even the code above has a bug undesirable behavior as I found out later. It is this: the main thread exists but the program still keeps running for a minute in an idle state before finally exiting. Upon investigation I found it is because of the 60s default keep-alive time of newCachedThreadPool. To fix this one should substitute the call to newCachedThreadPool with newFixedThreadPool(2) and also one should not forget to shutdown the ExecutorService otherwise the program will keep running forever.

Posted in Computers, programming, Software | Tagged , , , | Leave a comment

Using WSL – Windows Subsystem for Linux

I keep forgetting it so it deserves a post of its own. To access the Ubuntu directory from Windows Explorer type:

\\wsl$

Also run:

mklink /d c:\wsl2-home \\wsl$\Ubuntu\home\me

to create a directory on c:\ from where you can access the WSL filesystem refer this.

How to know if you are running WSL1 or WSL2

$ uname -r
5.10.16.3-microsoft-standard-WSL2

How to access C drive from WSL?

/mnt/c

How WSL works

The way I understand it is that WSL2 runs a build of Linux on a lightweight VM in Windows and that’s the way it works. So the takeaway here is that it is really running Linux not any translation or proxy layer.

Keyboard Shortcuts

Scroll Up/Down: Ctrl+Shift+Page Up/Down

Windows Terminal

Windows terminal is separate from the cmd.exe utility that comes with Windows. When you open WSL command prompt using the windows terminal, it actually executes following under the covers:

C:\WINDOWS\system32\wsl.exe -d Ubuntu

Enabling systemd on WSL2

$ cat /etc/wsl.conf
[boot]
systemd=true

Increase memory available to WSL2

cat %UserProfile%/.wslconfig
[wsl2]
memory=16GB

Limit the size to which WSL2 can grow

# Can be used to limit the maximum size that a distribution file system is allowed to take up.
# https://chatgpt.com/c/69aa1f04-18a0-83e8-ba0f-01d0ea45eacf
defaultVhdSize=230GB

remember to shutdown and restart WSL2 after that

wsl --shutdown

verify there are no running instances

wsl --list --running

Danger Zone!

Once I had to reset my PC as I was getting a blue screen and Windows wasn’t booting up. Even though I selected to Keep my Files option, all the data of WSL2 was deleted. So do beware of this.

All WSL2 data is stored in this file c:\users\siddj\AppData\Local\Packages\CanonicalGroupLi...\LocalState\ext4.vhdx

 Get-Item "$env:LOCALAPPDATA\Packages\CanonicalGroupLimited.Ubuntu_*\\LocalState\\ext4.vhdx" | Select Name,Length

Name            Length
----            ------
ext4.vhdx 177508188160

Other Notes

Over time I have begun to like WSL because it avoids all the surprising gotchas that happen when you develop on Mac but deploy on Linux esp. when doing C++ development. E.g., installing NGINX is easy on WSL. Contrast this to Mac. NGINX does not have official installer on Mac and the Homebrew software has caused me more pain than pleasure.

Many software just don’t work on Mac. Some examples: 1, 2

Another bonus is that with WSL I was able to test a program I would deploy with systemd. With Mac you can’t do that as there is no systemd.

Yet another bonus (not related to WSL specifically but an issue with Mac) is that w/ Mac I was once recording a screen and to my surprise there is no way to pause the recording (just search on google and let me know if you find an answer)! With Windows I can do that. I can pause and resume using the screen recorder that comes with the snipping tool.

One tip when using WSL is that if you run any program from Windows (/mnt/c) it will run slowly. The fix is to treat WSL as a separate machine and install separate copy of software on WSL. E.g., see this.


Over time I have now completely switched over to WSL2 from Mac and liking it so far. What do you prefer?

Monitoring Network Activity

Try installing mitmproxy and run [1]:

mitmproxy --mode local

also see this for quick and dirty and this. wireshark is the other tool.

The catch with mitmproxy is for it to intercept requests, all traffic has to be routed through it. It seems there is no good way to do this. A hack is to set following environment variables:

# In ~/.bashrc or ~/.zshrc:
export http_proxy=http://<windows-ip>:8080
export https_proxy=http://<windows-ip>:8080
export no_proxy=localhost,127.0.0.1

Its a hack because this is not a standard but many programs respect these env variables [1].

Using `ss -tpln you can get list of processes with open sockets. To get the name of actual process run (1520952 is the PID in below):

$ ps -fp 1520952
UID PID PPID C STIME TTY TIME CMD
siddjain 1520952 1518245 33 13:16 pts/2 00:00:32 /home/siddjain/.vscode-server/bin/6f17636121051a53c88d3e605c491d

Yep — that PID is a VS Code Server child.

Your ps shows the binary under the remote VS Code install:

/home/siddjain/.vscode-server/bin/6f17636121051a53c88d3e605c491d...

So the Node process listening on *:3761 was spawned by VS Code Server (likely an extension host, debug adapter, or a tool the extension started), not your own app.

Here’s how to see exactly what it is and why it’s on 3761:

  1. Get the full command line & executable
ps -o pid,ppid,user,lstart,cmd -p 1520952
readlink -f /proc/1520952/exe
tr '\0' ' ' < /proc/1520952/cmdline; echo

  1. See which socket it opened and from where
lsof -Pan -p 1520952 -iTCP -sTCP:LISTEN
ls -l /proc/1520952/cwd

  1. Walk the parent chain to confirm VS Code Server
ps -o pid,ppid,cmd -p 1520952
ps -o pid,ppid,cmd -p 1518245     # parent shown in your ps
pstree -sap 1520952               # if pstree is installed

  1. Quick “who’s on this port?” one-liner
lsof -iTCP:3761 -sTCP:LISTEN -n -P

If it turns out to be an extension:

  • Disable it temporarily in VS Code and reconnect (Command Palette → “Extensions: Show Running Extensions” and “Developer: Show Running Extensions”).
  • Or kill the server session: Command Palette → “Remote: Kill VS Code Server on Host”.
  • Worst case, nuke that specific server build and let it reinstall: rm -rf ~/.vscode-server/bin/6f17636121051a53c88d3e605c491d*

Copy contents of a file to the clipboard in WSL

cat file.txt | clip.exe

Protect against out of disk space

if you are running Docker (I hate it!) it can quickly cause you to run out of disk space and cause a catastrophic failure where WSL2 or even Windows cannot start. To protect yourself against this happening create a 1GB file that you can delete when you run out of disk space – its like keeping 1GB in reserve that no one can touch. Here are the steps courtesy of ChatGPT:

Great question. That “airbag” file is a tiny trick that can save you when a filesystem gets 100% full and basic tools start failing.

What it is

# create a 1 GiB reserve file in your home (same FS that fills up)
fallocate -l 1G ~/.space_rescue || dd if=/dev/zero of=~/.space_rescue bs=1M count=1024

  • fallocate -l 1G …: asks the filesystem to pre-allocate 1 GiB of real disk blocks for that file (fast, no zeroing).
  • || dd …: fallback if fallocate isn’t supported; dd actually writes zeros to fill 1 GiB (slower, but universal).

Either way, you end up with a normal file that really consumes space now.

How it helps

When the disk later hits 100%:

  • You can instantly free space by deleting or truncating this one file—no need to hunt for big folders or run rm -rf on trees:# fastest: just unlink it rm -f ~/.space_rescue # or keep the file but free its blocks : > ~/.space_rescue # truncates to 0 bytes
  • That immediately gives you ~1 GiB of breathing room so that:
    • logs/services can write again,
    • package managers and shells stop crashing,
    • you can run du, ncdu, apt-get clean, etc., to do a proper cleanup.

Important details

  • Put it on the same filesystem that fills up. Verify with:df -h ~ ~/.space_rescue In WSL2, that means inside your Ubuntu ext4 (not under /mnt/c).
  • Name & location: your home (~/.space_rescue) is fine; don’t put it in /var where cleaners might delete it automatically.
  • Size: 1–2 GiB is plenty. Create larger if your workloads spike hard:fallocate -l 2G ~/.space_rescue
  • Recreate it after cleanup so you’re protected next time.
  • Doesn’t help Windows C: directly. It only frees space inside the WSL filesystem. To reflect the savings in Explorer, you still do:
    1. sudo fstrim -av (inside WSL),
    2. wsl --shutdown (Windows),
    3. compact the VHDX (DiskPart/Hyper-V as we discussed).

Extras & cautions

  • On copy-on-write or quota’d filesystems behavior can differ, but on WSL’s ext4 it works as expected.
  • If you accidentally place it on a different mount (/mnt/c), it won’t help when Ubuntu’s root (/) is full.

TL;DR: it’s a pre-reserved chunk of disk you can delete in one command to instantly recover from a “disk full, everything breaking” situation.


SOS: Compacting WSL2 VHDX when you run out of disk space

Prerequisites: HyperV Platform and HyperV Management Tools must be installed

  1. Free up space inside WSL
rm -f ~/.space_rescue/*  # or whatever path you used
sync

Step 2: Inside WSL2

sudo fstrim -av

this tells WSL’s ext4 to discard freed blocks (important)

  1. Shutdown WSL (releases the VHDX lock)
    From Windows (PowerShell):
wsl --shutdown
  1. Compact the VHDX
    Option A (best): Optimize-VHD (requires Hyper-V feature / module)
# Run in an elevated PowerShell
Optimize-VHD -Path "C:\path\to\ext4.vhdx" -Mode Full

Use WinDirStat to inspect the windows filesystem and find out which files are taking up most space.

Cleaning up WSL2

rm -rf ~/.vscode-server/data/CachedExtensionVSIXs
rm -rf ~/.vscode-server/data/User/workspaceStorage
npm cache verify
pip cache purge
du -h ~/.cache/pypoetry/virtualenvs -d 1 | sort -hr | head -n 30

Ubuntu Terminal Profile

sometimes the profile gets messed up. here is a copy

For Icon choose File ms-appx:///ProfileIcons/{9acb9455-ca41-5af7-950f-6bca1bc9722f}.png

Posted in Computers, programming | Tagged | Leave a comment

How to process data in Google BigQuery?

In this post we discuss some of the ways the data in BigQuery can be processed. First, for those unfamiliar with it, what is BigQuery? BigQuery is Google’s OLAP (Online Analytics Processing) database. In my previous company, we used Hive – maybe you have used that. If so, BQ is very similar. Its implementation might be different but it serves the same purpose. From Hive documentation:

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

BQ is just that. It does not use HDFS and separates compute from storage (one key innovation behind its performance). This also allows Google to charge separately for storage vs. compute. Its storage is based on Google’s proprietary Colossus file system which is successor to GFS and underlying technology behind GCS (thus, Colossus = Google Cloud Storage). Below is a graph comparing BQ to Hive (as well as Amazon Redshift and Azure Synapse) on sotagtrends:

We can see more companies are now using BQ and usage of Hive is slowing down (BQ launched in 2010 btw). This is because BQ is a Platform as a Service which means you don’t have to maintain and fine-tune any infrastructure. Google handles all of that for you and you continue to benefit from continuous improvements to the software.

BQ is fully compliant with ANSI standard SQL:2011 and adds additional features on top of that. The BQ SQL dialect is also known as ZetaSQL. Anyway once you have data in BQ, how do you process it further? Below summarizes the options IMO:

Option 1: Process the data in BigQuery itself

Pros: Processing is fast, no additional technology to be used

Cons: all computation has to be expressed in SQL. Writing, refactoring and modularizing SQL is hard. Not everything can be expressed in SQL. Just because you can do something in SQL, doesn’t mean you should.

When to Use: You are comfortable with SQL. All data is in BQ. dbt is a popular tool that can help with running the SQL code, auto-persisting the query results and other tasks.

Option 2: Console Application

Pros: Hybrid approach. You decide what you want to do in SQL and what not. You don’t have to jump to advanced technologies like Spark or DataFlow which take time to learn. This option is familiar to developers who are experienced with querying MySQL as example from PHP. Its the same design. BQ provides client libraries in 7 languages – Java, Python, Node.js, Go, C#, PHP, Ruby.

Cons: IMO – There is little in terms of cons because you decide how much you want to do in SQL vs. Java or Python. You decide how much computation you want to push to the database and how much computation you want to do on the client. With Option 1, all computation is pushed to the database.

When to Use: Not all data is in BQ. In that case you can’t process in BQ itself. Or use when SQL is not your strong forte and you want to build a traditional application written in Java/Python etc. with CI/CD pipeline.

Option 3: Dataflow

Pros: Massively parallel distributed computation

Cons: Another technology to learn and master. Performance might not be better than if you had done your processing in BigQuery itself.

When to Use: When console app will not be able to give the desired performance. All data is not in BQ.

Let me know what option you are using in your organization and what you think.

Posted in Computers, programming, Software | Tagged , | Leave a comment

StackOverflow Developer Survey 2022

The SO Developer Survey for 2022 is out. Some things that struck me as odd:

I just can’t believe that people hate MATLAB. I love it. Similarly I was surprised to find out that R and PHP are dreaded languages. I actually like them. For me, Java would be a dreaded language.

Next thing that took my attention was this:

I hate Docker. It is so buggy. And difficult to use. But there is no alternative so I use it.

I have never used PostgreSQL but interesting to see its the most loved DB

Among professional developers, MySQL and Postgres are actually tied in terms of usage:

Another oddity: Oracle is one of the most dreaded databases per SO developer survey with 2 people voting it down for every thumbs up (2:1 ratio of dreaded:loved). However its the #1 database on db-engines.com rankings:

I think it just goes onto showing that the rankings can often be misleading and should be taken with a grain of salt.

Posted in Computers, programming, Software | 1 Comment

Query Google BiqQuery from a Scala console app

Objective

How can we query Google BigQuery from a Scala console app?

Step 1 – Create project on Google CLoud, enable billing, Install Google Cloud SDK, initialize and authenticate

Refer to online documentation for how to do all this. You need to have a project and enable billing to use BigQuery. You may also need to enable BigQuery API. To initialize the SDK and obtain an access token run:

$ gcloud init
$ gcloud auth application-default login 

gcloud init will set the project for you. If you want to change it later, you can do so using:

$ gcloud config set project PROJECT_ID

You can also create a service account if you want. The code to do that from a bash script is as follows (credit to Lak):

gcloud iam service-accounts create ${ACCTNAME} --description="My Service Account"
gcloud iam service-accounts keys create keyfile.json \
  --iam-account ${ACCTNAME}@${PROJECT}.iam.gserviceaccount.com

for role in roles/bigquery.dataEditor roles/bigquery.jobUser; do
  gcloud projects add-iam-policy-binding ${PROJECT} \
    --member serviceAccount:${ACCTNAME}@${PROJECT}.iam.gserviceaccount.com --role ${role}
done

This will create a service account and a file keyfile.json on your computer.

Step 2: Install sbt and all its dependencies (JVM etc.)

Details outside scope of this article. Refer online documentation

Step 3: Create empty folder and initialize it with a build.sbt

Create and empty folder and create a build.sbt file in it with following contents:

name := "Sid's project"

scalaVersion := "2.12.16"

libraryDependencies ++= Seq(
	// If you use organization %% moduleName % version rather than organization % moduleName % version (the difference is the double %% after the organization), sbt will add your project’s binary Scala version to the artifact name.
	// https://www.scala-sbt.org/1.x/docs/Library-Dependencies.html
	"org.apache.spark" %% "spark-sql" % "3.3.0",
	// https://github.com/GoogleCloudDataproc/spark-bigquery-connector
	"com.google.cloud.spark" % "spark-bigquery-with-dependencies_2.12" % "0.25.1"
)

Step 4: Write Scala code to query BigQuery

A minimal code is as follows. Save it in a file named SimpleApp.scala:

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import com.google.cloud.spark.bigquery._

object SimpleApp {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.appName("Simple Application").config("spark.master", "local").getOrCreate()
    val df = spark.read.option("parentProject", "XXX").bigquery("bigquery-public-data.samples.shakespeare")
    df.show()
    val x = df.groupBy("word").agg(sum("word_count").alias("count")).orderBy(col("count").desc)
    x.show(100)
    spark.stop()
  }
}

Replace XXX with your project ID. df is a Spark SQL DataFrame. In above we are querying table shakespeare from the public dataset samples under project bigquery-public-data. Ideally we shouldn’t have to run Spark as our objective is to develop a console application, but the Scala library I found for reading from BQ is coupled to Spark. Assignment: decouple it from Spark. After reading the data we do some processing to determine the count of every word and sort in descending order of the count.

Step 5: Run sbt

We are now ready to run the code. Start by launching the sbt server:

$ sbt

You should get an output like below:

[info] welcome to sbt 1.6.2 (Oracle Corporation Java 11.0.14)
[info] loading project definition from /Users/XXX/bigquery-scala/project
[info] loading settings for project bigquery-scala from build.sbt ...
[info] set current project to Sid's project (in build file:/Users/XXX/bigquery-scala/)
[info] sbt server started at local:///Users/XXX/.sbt/1.0/server/93733b8d4d1bbdc577aa/sock
[info] started sbt server

Step 6: Compile the code

Run

sbt:Sid's project> compile
[info] compiling 1 Scala source to /Users/XXX/bigquery-scala/target/scala-2.12/classes ...
[success] Total time: 4 s, completed Jun 19, 2022, 12:15:58 PM

Step 7: Run the code

sbt:Sid's project> runMain SimpleApp

Truncated output:

+------+-----+
|  word|count|
+------+-----+
|   the|25568|
|     I|21028|
|   and|19649|
|    to|17361|
|    of|16438|
|     a|13409|
|   you|12527|
|    my|11291|
|    in|10589|
|    is| 8735|
|  that| 8561|
|   not| 8395|
|    me| 8030|
|   And| 7780|
|  with| 7224|
|    it| 7137|
|   his| 6811|
|    be| 6724|
|  your| 6244|
|   for| 6154|
|  this| 5803|
|  have| 5658|
|    he| 5411|
|   him| 5407|
|  thou| 4890|
|  will| 4737|
|    as| 4516|
|    so| 4319|
|   her| 4131|
|   The| 4070|
|   but| 3994|
|   thy| 3876|
|   all| 3681|
|    To| 3592|
|    do| 3433|
|  thee| 3370|
| shall| 3282|
|   are| 3268|
|    by| 3165|
|    on| 3027|
|    no| 2934|

Verify against BQ:

This finishes the reading from BQ part. What if you want to store results to BQ?

Writing to BigQuery

I wasn’t able to get it to work. In theory the code to write to BQ is:

x.write.format("bigquery").option("writeMethod", "direct").save("sample.scala_word_count")

However this gave me following error:

[error] java.lang.IllegalArgumentException: Either temporary or persistent GCS bucket must be set

It looks like the library we are using first stages the data to GCS (Google Cloud Storage) and then loads it into BQ i.e., option("writeMethod", "direct") does not work. There are a number of steps you have to do (at least I tried doing) to get it to work:

Need to create a GCS bucket and add this code to SimpleApp.scala:

val conf = spark.sparkContext.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
val bucket = spark.sparkContext.hadoopConfiguration.get("fs.gs.system.bucket")
spark.conf.set("temporaryGcsBucket", "your-gcs-bucket")

Need to add following dependency to build.sbt:

"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.7"

Even with these changes, I got following error after which I gave up:

22/06/19 17:24:29 WARN FileSystem: Failed to initialize fileystem gs://my-gcs-bucket/.spark-bigquery-local-1655684190889-861d7ac8-c4c6-4e20-a2b0-c0f356445c72: java.io.IOException: Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
Caused by: java.net.SocketTimeoutException: connect timed out

I then tried using another version of the library based on what I read here.

The Spark 2.4 dedicated connector supports writing directly to BigQuery without first writing to GCS, using the BigQuery Storage Write API to write data directly to BigQuery. In order to enable this option, please set the writeMethod option to direct

here are details:

"org.apache.spark" %% "spark-sql" % "2.4.0",
	// https://github.com/GoogleCloudDataproc/spark-bigquery-connector
	// https://repo1.maven.org/maven2/com/google/cloud/spark/spark-2.4-bigquery/0.25.1-preview/spark-2.4-bigquery-0.25.1-preview.pom
	"com.google.cloud.spark" % "spark-2.4-bigquery" % "0.25.1-preview"

and this gave me this. After this it was clearly too much. To see the internals of write implementation refer this.

Running from Scala shell (REPL)

You can also run the code from the interactive shell. This is useful when you are prototyping. To do that run:

$ sbt console

This will launch scala interpreter (REPL) instead of the sbt server. See this for the difference between scala and sbt console.

SBT is tied to a specific project defined by a build.sbt file in a way that sbt console will load up the same REPL environment as scala but with the addition of all of the project code and dependencies defined in the build available for import. Also, it will use the version of Scala defined by build.sbt

What I learnt

It is possible to write a console app to query BQ from Scala but the code is coupled to Spark which is undesirable (or maybe it is because that’s the only way to get the data as a DataFrame) and I wasn’t able to write to BQ. Scala is a powerful language but difficult to learn and not for the average developer. The difficulty is compounded by the fact that there are less examples than you can find for languages like Python or Java and a smaller community to help you out. If you are bent on using Scala for something, you can find libraries but I would describe them as indie (as in indie movies. The term indie is used with the meaning “Independent” to refer to movies, music and video games created without financial backing from major companies.) – use at your own risk, lack of examples, not feature complete, may be buggy. Let me know what you think.

Posted in Computers, programming, Software | Tagged , | Leave a comment

Principles of Engineering Excellence

What is Engineering Excellence? EE to me is the passionate pursuit of perfection and its really about taking pride in what we build. EE to me is about code quality, developer experience, technical documentation, instrumentation, performance and scalability. It is not about maximizing the # of commits we make every day. If you measure your team velocity by the # of commits / day, that’s a wrong metric. Team velocity should be measured by the time it takes to make a release. EE avoids taking shortcuts and incorporates security best-practices etc. Because of this, it can be at odds with release velocity. Below is a checklist you can use to evaluate where you stand when it comes to EE:

  • Do you have a code repository?
  • Do you have an efficient code-review process? What is your check-in policy? I have a simple code-review policy. Code is assumed to pass review if # of approvers > # of rejects. Developers are pesky when it comes to reviews and requiring approval from everyone leads to nasty quarrels.
  • Are PRs linked to work items in a task tracking system like JIRA etc.?
  • Do you have policies in place that protect your public branches (master, release etc.) from malicious pushes (commits)?
  • Do you follow trunk based development?
  • Do you have CI? To me, CI comprises 4 things:
    • A PR is automatically rejected if its not on top of latest code in the master or main branch (--ff-only)
    • Every PR triggers an automated build and associated tests whose results can be viewed by reviewers
    • Every commit triggers an automated build and other validation tests
    • Bonus: Bad commits are automatically rolled back
  • Do you squash PRs (git merge --squash --ff-only) and delete PR branches?
  • Do you have a deployment pipeline? (Azure Pipelines, Bitbucket Pipeline, Jenkins etc.). Deployments don’t have to be automated but is there a one-click deployment process?
  • Do you have a deployment dashboard where deployments can be rolled back?
  • Do you have 3 deployment environments – dev, qa, prod?
  • Do you use federated IAM? (the defining characteristic of federated IAM is that your application never sees the user’s password)
  • If not, do you salt and hash passwords?
  • Do you use service accounts / managed identity?
  • Do you use a secret manager / Key Vault for storing passwords, private keys, tokens and other secrets?
  • Are your secrets periodically rotated?
  • Are you checking in passwords / secrets into source control? (answer should be no)
  • Do your developers have write privileges to production databases?
  • Do you have Role Based Access Control (RBAC)?
  • Do you mask sensitive data (PII, restricted, HIPAA etc.)?
  • Do you have data backup, disaster recovery, geo-replication and data loss prevention? Have you tested it?
  • Do you have audit-logs to investigate unauthorized access to data?
  • Do you have a troubleshooting runbook (SOS manual) that can be relied upon during on-call? Do you keep it up-to-date? Is it written in Markdown?
  • Do you have a README.md in your repo that explains a new developer how to install all dependencies and pre-requisites, get permissions, build, run, debug, test and deploy the code?
  • Do you use Microsoft Word and Sharepoint for technical documentation? (answer should be no)
  • Do your developers spend more time in Powerpoint and less time in Markdown and Graphviz? (answer should be no)
  • Do you test your unhappy code path (the code in the catch block)?
  • Have you tested all possible error conditions (things that can go wrong)?
  • Bonus: Do you perform chaos testing?
  • Are your applications instrumented? Do you have an application performance monitoring dashboard where you can monitor resource usage, traffic, latency etc.?
  • Are you testing your webapps using webpagetest.org or another such tool? 
  • Are you logging and alerting on 5xx errors and unexpected exceptions?
  • Are your APIs gated behind a rate-limiter?
  • Do you perform any static code analysis (SonarQube)?
  • Do you check your dependencies for vulnerabilities?
  • If using mocks, are you aware that code that is mocked is effectively not tested and should therefore be excluded from code-coverage? This is why I hate mocking. Mocking hides problems in the code and gives false sense of security.
  • Do you have an E2E test environment where there is absolutely no mocking?
  • Bonus: are you doing test driven development?
  • Bonus: are you doing automated UI testing (selenium)?
  • Are you disposing off resources properly? Do you understand the dispose pattern in C# / Java? Difference between managed and unmanaged resources?
  • Do you use connection pooling for database access?
  • Are your REST endpoints / APIs formally documented?
  • Are you using HTTP-JSON for public facing APIs and gRPC / Thrift etc. for internal APIs (microservices)? Public API = an endpoint called by a web or mobile application in response to user-interaction. Internal API = an API called by a program e.g., data ingestion pipelines.
  • Do you have a fragmented architecture making use of disparate technologies or an integrated ecosystem?
  • Instead of thinking of Performance! Performance! Performance! think Tests! Tests! Tests!

Further Reading

However, on a large team in an enterprise environment I often do not recommend artificially keeping a linear history – TOP VOTED ANSWER

However I still maintain my position of keeping a linear git commit history. It is easier to understand, analyze and troubleshoot with git bisect. Prior to git we used TFS. It automatically rejected a commit if the developer’s code was not synced to the latest. Yes there is a downside that you always need to sync to latest and test your changes are still good – there is extra overhead but I think worth it.

Posted in Computers, Software | Leave a comment

Git: Back to the Basics

Sample ~/.gitconfig file:

[user]
	name = Siddharth Jain
	email = sid@gmail.com

[init]
	defaultBranch = master

[core]
  editor = code --wait

[diff]
  tool = vscode

[difftool "vscode"]
  cmd = code --wait --diff $LOCAL $REMOTE

[merge]
  tool = vscode

[mergetool "vscode"]
  cmd = code --wait $MERGED

[pager]
        branch = false
        config = false

[alias]
    lg = lg1
    lg1 = lg1-specific --all
    lg2 = lg2-specific --all
    lg3 = lg3-specific --all

    lg1-specific = log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(auto)%d%C(reset)'
    lg2-specific = log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold cyan)%aD%C(reset) %C(bold green)(%ar)%C(reset)%C(auto)%d%C(reset)%n''          %C(white)%s%C(reset) %C(dim white)- %an%C(reset)'
    lg3-specific = log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold cyan)%aD%C(reset) %C(bold green)(%ar)%C(reset) %C(bold cyan)(committed: %cD)%C(reset) %C(auto)%d%C(reset)%n''          %C(white)%s%C(reset)%n''          %C(dim white)- %an <%ae> %C(reset) %C(dim white)(committer: %cn <%ce>)%C(reset)'

What it does: it sets VS Code as the diff and merge tool. The commands under alias are useful for visualizing merge history. see this, this, this.

Understand merging and branching

git merge is used to merge branches. The command is run on the target branch. You can implement different merge strategies using the --ff, --no-ff, --ff-only and --squash options. Atlassian explains it over here. The settings that I like are as follows:

  • Use the default option (git merge --no-ff) to merge changes from release or staging branches into the main or master branch. If you are following trunk based development (recommended) it actually advises against merging from release into master. Instead it recommends to cherry-pick commits from master into release. See this.
  • Use git merge --squash --ff-only to merge feature branches into the main or master branch
source: https://trunkbaseddevelopment.com/branch-for-release/#cherry-picks-from-the-trunk-to-branch-only

Feature branches are branches that are meant to be temporary and meant to be deleted after PR is committed. In contrast, the release or staging branches are meant to be permanent. git merge --no-ff will allow you to restore a release branch if you delete it accidentally because it preserves the complete history of changes. We’ll see how later on in the article.

The --squash option does not preserve the history of changes and will not show any branching when you try to visualize the commit history. This is what is referred to as a non-merge commit e.g., in this article. It is also known equivalently as a fast-forward commit.

Its called non-merge commit because if you visualize the commit history, you won’t see two branches (the feature and master) merging into one. What you will see is a linear commit history as-if the developer had made changes on top of master without checking out a feature branch. This is what we want when developers are merging their changes as part of normal workflow. We are not interested in seeing all the tiny work-in-progress changes they did. Nor are we interested in keeping a track of when they cut out their feature branch from master.

The --ff-only will not allow a developer to merge their changes (i.e., PR will not succeed) unless they are synced to the latest code in the master. To sync to the latest the developer should:

  • checkout master branch (git checkout master)
  • run git pull to pull latest code from the remote master
  • switch to their feature brach (git checkout feature-branch)
  • run git rebase master on their feature branch
  • resolve conflicts. More on how to do this later.
  • Test the new changes!

They should then push their changes back to remote PR branch and now the --ff-only option will not complain when we try to merge the PR.

NEVER run git rebase on the master or other protected branches. See the golden rule of rebasing.

The golden rule of git rebase is to never use it on public branches.

This is because git rebase rewrites the commit history and you most likely don’t want that for your master or release branches. Its always the feature branches that are supposed to be rebased on top of the master branch.

If you are using GitHub, the default merge option on GitHub corresponds to git merge --no-ff.

For completeness, we next cover what git merge --ff does – this is the default option btw if you don’t specify any options to git merge. If the source branch has not diverged from the master (target) – meaning if you start at the HEAD of the source branch and trace the parent commit recursively, you will arrive at a commit that is the HEAD of the target branch – then, --ff creates a non-merge commit similar to --squash. If you visualize the history, you won’t see any branching and merging; you will see a linear commit log. The difference from --squash is that you will see all the commits. If the source branch has diverged from the master then --ff will create a merge commit. So --ff means fast-forward if you can. Diverge equivalently means that developer is not synced to the latest – that is the language I used earlier to explain the effect of --ff-only.

The --no-ff option will always create a merge commit even if source and target have not diverged. I feel this is useful when merging release branches onto master.

Below is what a merge commit looks like when source and target have not diverged. This is what you will get if you used the --no-ff option to do the merge:

With the --ff option this would appear as a linear series of commits in git history without any branching and merging as-if the developer had made their changes directly on top of master. The HEAD on master is just fast-forwarded to latest commit:

* 6372e95 - Thu, 26 May 2022 14:17:28 -0700 (58 seconds ago)
|           changes 6 - Siddharth Jain
* 0c287bd - Thu, 26 May 2022 14:17:06 -0700 (80 seconds ago)
|            changes 5 - Siddharth Jain
* 6878843 - Thu, 26 May 2022 14:12:48 -0700 (6 minutes ago)

This is what a divergence looks like:

The developer checked out their feature branch off of commit 256ad63 but then while they were working on their feature branch, someone committed 35da6ba onto master. Now the developer is no longer synced to the latest. The --ff-only option will not let commit bf4fcf9 be merged onto master. You will get something like below if you try it:

$ git merge mychanges --ff-only
fatal: Not possible to fast-forward, aborting.

Pop Quiz: What is parent of commit 46494e4? Is it bf4fcf9 or 35da6ba or undefined? You can get the answer by running:

$ git checkout HEAD~1
Previous HEAD position was 46494e4 merge
HEAD is now at 35da6ba master branch changes

So the parent commit is the commit that happened in that branch. The convention is somewhat arbitrary – in Jon Loeliger’s book he says:

A merge commit has more than one parent

Version Control with Git, p. 76

This is a good working definition of a merge commit. A merge commit preserves the history and you can restore state bf4fcf9 even after you delete the feature branch by running git checkout bf4fcf9. Your HEAD will be in a detached state.

Below then is summary of options that can be used with git merge:

  • --ff (Default): Make a fast-forward (aka non-merge) commit if source branch has not diverged from target, otherwise make a merge commit
  • --no-ff: Do not make a fast-forward commit even if source branch has not diverged from target
  • --ff-only: Reject the merge if the source branch has diverged from the target branch. If not, a fast-forward commit will be made. You will never run into merge conflicts with this option.
  • --squash: consolidate all commits – the changes – in the source branch into one and make a fast-forward commit. Same as --squash --ff.
  • --squash --ff-only: Reject the merge if source branch has diverged from the target. Otherwise, do the same thing as --squash would do.

Note that --squash --no-ff is not a valid combination since --squash by its nature always results in a fast-forward commit.

Fast-forward commit = Non-merge commit = Linear commit history = Single parent = it looks as if developer made changes directly on top of the target branch

Above discussion only covers a subset of git merge. Refer to online documentation for list of all the options.

Resolving conflicts while rebasing or merging

Follow following steps to resolve conflicts while rebasing:

  • run git mergetool
  • if you are using the .gitconfig in this article, VS Code should open (or otherwise whatever mergetool you have configured should open)
  • Use VS Code to resolve conflicts
  • save the file(s)
  • close the file(s)
  • switch back to terminal
  • run git rebase --continue
  • One of two things will happen:
    • all conflicts are resolved and VS Code opens again with a commit message. edit the message as appropriate, save and close the file. The rebase should show as completed on the terminal (command-line).
    • the rebase process will encounter more conflicts in which case you repeat the above steps all over again

The commands to resolve conflicts while merging are a little different. You have to run git commit instead of git rebase --continue. E.g.:

ᐅ  git status
On branch master
All conflicts fixed but you are still merging.
  (use "git commit" to conclude merge)

Changes to be committed:
	modified:   foo.txt

Trunk Based Development

This is the development workflow recommended and followed by many companies these days. There is a main branch and optional release branches. Chances are you are doing it without realizing it. In this model:

  • You may cut out release branches or release directly from master e.g., if you have daily releases – the team can tag the main trunk at end of the day as a release commit
  • The main branch is assumed to always be stable, without issues, and ready to deploy
  • Feature flags are used to contain code that is under progress and should not be turned on until complete
  • Instead of merging bugfixes from release into master, it recommends committing the bugfix into master and then cherry picking into release. This rule for Trunk Based Development remains difficult to accept, even within teams practicing everything else about Trunk-Based Development (ref).

 if you are fixing bugs on the release branch and merging them down to the trunk you are doing it wrong…Bugs should be reproduced and fixed on the trunk, and then cherry-picked to the release branch

See this to make sure you are not doing it wrong.

Below flowchart shows visually the code check-in process:

The code-checkin process. Boxes colored grey indicate activity that happens on BitBucket, GitHub etc. Boxes colored white represent activity on local computer. source.

Compare it to this if you like. You should continuously check whether you are rebased on top of latest. To keep the flowchart simple, I haven’t added additional boxes to reflect that. E.g., even before pushing your changes to remote branch, its a good idea to check if you are synced to the latest. Source code of the flowchart can be found here.

More

Posted in Computers, Software | Tagged | Leave a comment

React File Picker

We start with functional requirements:

  • a React control that allows user to select multiple files for uploading to a server
  • user should be able to order the files (i.e., the order of files matters)
  • user should be able to delete (remove) files from the selection

Bonus:

  • the application can set a limit on the number of files user can select for uploading
  • the application can set a limit on the maximum individual file size. user cannot select file more than max file size
  • the application can predefine the file types user can select (e.g., only png and jpg images)

This seems like a common enough scenario but surprisingly there is no generic control in HTML with above features. HTML comes with <input type="file"> element which we will use as the foundation of the control we are developing. Familiarize yourself with this HTML element before proceeding ahead.

Ready? Let’s get started.

Step 0: Install Node.js and CRA (create-react-scripts)

Step 1: Run npm init to create a new Node.js project

A note: even though we are creating a Node.js project, the React control we are developing is going to run on a browser. Remember Node.js is server side technology. Your browser (Chrome) does not run Node.js. It provides a JavaScript execution environment but that is not Node.js. The distinction is important to remember.

Step 2: The React code

If you are new to React, first understand the difference between classic React components (legacy) and the newer functional components (recommended). If you browse code on the web, you are likely to come across both. The two use different programming styles, so it is easy to get confused for a newbie. Also the functional components are NOT stateless. They are called functional because they don’t use classes. We will create a functional component.

The heart of the component will be a list object which is nothing but the list of files that the user has selected. The order of items matters in this list. This list is used to render a HTML table which displays metadata about a file (such as filename, size file type etc.) together with buttons to reorder items in the list or remove them. A picture is worth a thousand words so see the example below:

React will react to changes in this list and automatically update the HTML table without us having to do anything. That’s why its called React btw – because it reacts to changes in variables. These variables are known as observables and reminds me of an old library KnockoutJS that I used once long before React appeared. The pattern is known as the Observer pattern in programming.

Here is the skeleton of our React component:

import React, { useState } from "react";

const FileInput = ({ onChange, maxFileSize, accept, maxFileCount }) => {
  const [list, setList] = useState([]);

  const handleUp = (e, i) => {
    // handle up button and re-order list accordingly
  };

  const handleDown = (e, i) => {
    // handle down button and reorder list accordingly
  };

  const handleDelete = (e, i) => {
     // remove item from the list    
  };

  const validate = (file) => {
    // validate that file does not exceed predefined maxFileSize
  };

  const renderHtmlTable = () => {
    // render the list as an HTML table
  };

  const renderFileInput = () => {
    // render <input type="file"> HTML element which allows user to add items to the list
  };

  return (
    <>
      {renderHtmlTable()}
      {renderFileInput()}
    </>
  );
}

export default FileInput;

Completing the methods is left as exercise for the reader. Explanation of the arguments:

  • onChange: an event handler that is called whenever the file list changes (addition, deletion or change in order of items in the list)
  • maxFileSize: a number. units: bytes. user is not allowed to select a file whose size is greater than maxFileSize.
  • accept: same as accept. a string that defines the file types the file input should accept.
  • maxFileCount: used to limit the number of files user can select.

The useState method is how functional components access state in React. Familiarize yourself with it in case you don’t know about it.

The component requires react and react-dom dependencies to function. These dependencies should be declared under peerDependencies section in package.json to avoid having two copies of React in the final application which gives a runtime error if that happens.

Step 3: Building and packaging the code

JSX code has to be compiled to JavaScript for browsers to understand. For this we use babel. The component in this case is small (just one file) so we don’t need any fancy packaging but for bigger projects we can use webpack or rollup. webpack can call babel for you as shown below (this code will go in webpack.config.js):

module.exports = {
    entry: "./src/FileInput.js",
    output: {
      path: path.resolve(__dirname, 'dist'),
      filename: "main.js",
      library: pkg.name,
      libraryTarget: "umd",
      umdNamedDefine: true
    },
    module: {
      rules: [
        {
          test: /\.(js|jsx)$/,
          exclude: /node_modules/,
          use: {
            loader: "babel-loader"
          }
        },

The babel and webpack dependencies should go under devDependencies in package.json since you only need them in the build process. The built code (i.e., the code that you release) does not need them.

If you are writing in TypeScript then just add

"jsx": "react-jsx"

under the compilerOptions section of your tsconfig.json. The TS compiler will take care of compiling JSX to JS.

Step 4: Testing the code

To test the module before publishing to npm, pack it as a tarball:

$ npm pack

This will pack all the files under files in package.json and create a tarball (.tgz file). Then copy this tarball to your test project and install it by running:

$ npm i siddjain-react-bootstrap-file-input-1.0.0.tgz

Some online docs suggest using npm link to test a package locally before publishing to npm. In my experience, do NOT use npm link otherwise you run into invalid hook call.

Posted in Computers, Software | Tagged , | Leave a comment

M1 Mac Mini Review

Bought a M1 Mac mini after reading the rave reviews on YouTube and elsewhere. E.g., over here an enthusiastic person writes:

My 2021 MBA M1, 8GB, 256GB just works waaaay faster than my 2018 MacMini i7 (6-core 3.2GHz), 32GB, 500GB ssd

But not impressed with it:

  • In my tests it performs the same as a Beelink PC I have with Intel i5 8th generation processor – which I believe is a mid to even low end processor btw. For the test I used the program I use to make the digital paintings I sell on sidstick.com. The Beelink has twice the RAM and SSD and costs less than the Mac. I am not suggesting that an Intel i5 PC is faster than M1 Mac mini, but that a Mac mini is not always faster than a PC.
  • No way to change the scroll direction of the mouse from Natural to “Un-natural” (what windows user would expect) [1].

The pros are:

  • Yes, it looks nice like all of apple products which emphasize form over function. e.g., it has all the ports on the back which gives it a clean look when one sees it from the front but its very inconvenient to have to reach to the back to replace cables.
  • It has a premium feel to it not necessarily because its better but because you paid more for it
  • You can impress your friends. Apple products naturally appeal to the vain types. After all why else would anyone pay 2x for the same thing.
  • MacOS for some reason is more snappy than Windows. It boots faster. Applications also startup faster.
  • Very silent. No fan noise.
Posted in Computers | Tagged , | Leave a comment