Akul
2 min readApr 10, 2019

--

BigQUERY & Progressive Offline First WebApps

15–20 years back, People thought normalising and recombining means reducer function that would take parallel mappers to take scalar input and output from a function in a different form.

How do you use it? They said simple, split up in memory jobs into machines.

So they would spawn server nodes and instead of ipcs rely on actor like message passing or mpi model and wrap it with a protocol like zmq to transfer batch outputs from worker to worker.

And masters would work till failure. They would use all kinds of techniques round robin, interrupt based notifications to transfer loads and monitor logs. Soon these outfits doing data crunching coupling R/python libraries ontop of sql queries started to sound cool. But It was a pain in the butt when at 4 in morning someones job failed because of temp heroku resource maintenance or memory overload or nodes died. Because not every stack is a telecomm stack running on beams supervision trees. But they all wanted to emulate supervision trees that would come back after failure and continue abusing memory storage coz hey memorys cheap!

Not until Online interactive analytics met big data met a storage transfer rate of 100 MBps hard drives 3200 Cores on a 330 Gigabit per sec network, analysing small to mid scale dataset of read-only in place nested data by combining multi-level execution trees working in parallel (not logarithmic) on a columnar data layout In interactive response time was unheard of 15 years back. That too on a single query?!

Each sql call can split either a relational tables as Union of Rows or structured Nosql data as columns and split it recursively down the tree and is rewritten at each step. the queries reach the leaves, which scan the tablets in parallel and buildback going up with partial aggregations.

Google’s BigQuery runs on dremel, which is one such Query engine.

How can a simple data munging job of meagre columns go south and sour and slower? If you need to go from say a Web app to Querying back & forth multiple times, you’re probably using the wrong tool or in need of a different context. Do one such query right once only per API call.

WITH

. subquery1 AS(SELECT a, STRUCT(b, ARR_AGG(DISTINCT c IGNORE NULL) AS c) AS b GROUP BY 1, 2)

. subquery2 AS(SELECT AVG(a) AS avg_a, b FROM sunquery1)

SELECT t.d, s.avg_a, s.b

FROM t, subquery2 ON t.avg_ct=s.avg

For instance Querying like this should be 1:1 in proportion to the API request.

Such requests should ideally be non blocking and notified later on however instantaneously asking user to view results.

Usually such instances cache query results for faster lookup that can show results and later on through DOM shadow/document fragment, updates the view.

“When used with something like map reduce or time series analytics batch jobs, hadoop can act like an offline warehousing engine.”

To be continued…

--

--