Tree Browser - Massive Indefinitely Recursive Data presented through LiveView
First of all, please don't judge me or any of the participants in this project. We're not experts in building libraries for general consumption and have a varied skillset as well. We're brought together by having projects with large, complex, recursive data that we somehow need to present to our users via LiveView in a way that doesn't break the bank either for the users or our servers.
Nesting and nested data is everywhere, HTML and its DOM is fundamentally nested even though several of its primitives aren't. Even the most basic master-slave database structure is nested, and in HTML we've always had to choose at what level we show only the editable master, when we show the master with a summary of the detail, when we show a summary of the master and an editable list of the details and when we show a single detail record with minimal reference back to the master it falls under. As the web matured, different practices camme and went and if there was a clear winner in that regard that would long have been reflected in the sample code Phoenix generates for us. But there is no one sies fits all. That's nesting - we know the entire strucutre in advance and shaped the database accordingly so now it's just a matter of shaping the user interface to that as well. Not always simple, but done a million times over.
Recursion is a different animal. When data is recursive, we do not know the whole structure in advance. Or rather, we know only that it recurses and where, but not how many times it would happen. Could theoritically be an infinite number of recursions. Each time recursion occurs looks identical to the previous level except when there is no next level of recursion and it ends, for that particular branch of recursion even through others might still carry on.
This project aims to create an abstraction of recursive data which encapsulates as many of its confounding characteristics as possible to end up with a simple abstraction that can be applied at will, even multiple times, overlapping with each other.
It's easy to draw up trivial test data for a recursive structure, but trivial test data is unsuitable for testing at scale. It also farily easy to generate large amounts of data if you make use of text and number generators, but it is extremely hard to validate you end up seeing on the screen as correct before and after you've manipulated it by for example dragging and dropping large sections of data from one place to another. For that purpose, I build a data seeding algorithm specific to the project which is controlled by a few inpput parameters but result in a variety of datasets ranging from small to enough to fill a database and anywhere in between.
Seeing that the aim is to end up with code that can be used more than once on the same page to reflect linked but separate recursive relationships, the data I generate is structured like that.
I've build the test application (and its data) around the notion of a disk filled with file archives where each of the archives has its own folder and files within. On face value this might appear to have been a solved problem in the non-web world of desktop applications that we're trying (like so many have done before) to transplant into the browser domain, but there is a material difference involved here.
Most trees we're used to, file and email folders, outlines, tables of contents, etc, are displayed as tall and narrow strips next to the main content window. Whether it shows lines like it days gone by or just a simple mark indicating a collapsed or expanded state, the principle is the same - the "parent" is shown on top and its children as a list below it indented one "unit". It's a neat compromise when size is at a premium and it's secondary information. But when the recursive data belongs directly in the primary content window and seeing as many "archives" and their contents together on the screen as there's room for that side-on tree view is no longer the right tool for the job. The "trees" being browsed in this project should be condusive to being displayed in any orientation and using different HTML primitives.
There have been and still remain a great many challenges in this project. I've mentioned one or two, and met a few more.
Regardless of how the data is presented, it's most likely ending up on a rectangular (area of a) display. Recursive data isn't rectangular by any stretch of the imagination. It usually spreads out from a root or small set of main trunks to a great many branches. When loading data from a straight-up list of records (table) using the SQL and Ecto primitives limit and offset, it is trivial to get a rectangular subset of records for a pagination routine. But when you're dealing with recusrive data, snipping a rectangle out of the data with surgican precision (i.e. not acessing even one record more or less than what it takes to square up the selection) is not a trivial excersize. When the data really gets large and in unpredictable places any extra data such as oading thousands of childern when only a few are needed can ruin performance rather quickly.
The alrogirthm was challenging enough on its own. It had to start from an anchor node and given some window and buffer sizes work out the rest of the data in all four directions around the anchor so that it fills the window but also have enough buffered on all sides to allow for immediate visual scrolling while additional data is being fetched from the server. But then turning the elisive algorithm into code was even harder. It is in essense de-normalising data so pure SQL was never going to be up to the task and it needed a procedural approach. Recursive Common Table Expressions are "gateways" between SQL and procedural code so my first attempts and success was to use those. But by the time that worked the recursive CTE queries were massive and impossible to follow, let alone maintain and improve. Plus it leaned heavily on lateral joins and I discovered that lteral joins were unsupported by Ecto. (At some point, but they are now). I was also plagued by previous attempts at generating CTE using Ecto's query interface when there are field aliases involved, and I have plenty of those in my real project, so I wan't super keen on making all the effort of building CTEs only to not be able to use them after all. Passing parameters to CTEs and even to views became another sticky problem, in that you have to rely on the query optimiser to recognise which of the where clause conditions address the initial query's where clause, which the recursive query's where clause and which the overall query's where clause or else it went along and calculated the entire recursive result for every record in the table before handing you the one you wanted. For large data as I was testing with, that often lead to having to stop and start the database because it would never get back to me otherwise.
In the end, I changed tack completely, and took to implementing the procedural logic directly in PL/pgSQL knowing fully that it is by far the least portiable way of doing it, and the hardest to keep in check since the code for creating the functions and procedures had to go verbatim into a migration file. But I needed to find a workable solution first (and in this case workable also means reliably fast) before I could look at portability and sticking to a single programming environment. At the moment, my pagination routine calculates and loads a square window of records fast in a time between O(1) (constant) and O(log2n) (the complexity of indexed lookups in the DB) meaning that it is for the most part decoupled from the size of the underlying database which I consider worth the hassle of functions, procedures and views at this stage.
Which brings me to the built-in support for infinite scroll. Not only was I never able to really get the example implementation from the documentation and guiedes to work smoothly, there was a major problem in that I wanted and needed both X and Y scrolling while the built-in support was exclusively Y-scrolling. So I took what I could recognise from the standard JS Hooks and adapted that for my purposes.
For my way of presenting trees, scrolling in the X-direction (horizontally) is closer to standard infinite sroll. Scrolling in the Y direction (vertically) in tree data would render what you're looking at invisibly small or unworkably large very quickly unless you zoom as you scroll. For that reason I chose to implement Y scrolling as a kind of coordinated zoom and pan, like an aeroplane with coordinated aelerons and rudder to maintain a perfect balance. Each level you scroll "up" not opnly scrolls but zooms in so that the new top visible row of nodes are shown at the same scale as how the row that is now the second row was when it was at the top. The second row becomes wider, wider enough to fit centered beneath all its children, and every row below that gets the same treatment. Scrolling down it happens in reverse - scrolling and zooming out to make nods for which the parents are no longer shown at the top are shown at minimum size. The intention is that users would be able to manually change zoom levels as well, which will affect how big the nodes are shown and thus how many nodes there are room for vertically and horisontally, but for that manual zoom to be a separate button and/or pich-to-zoom interface whereas the horizontal pan and coordinated zoom and vertical pan responds to the X and Y scroll events.
I've done more JS programming in my life and projects than I care for, which is why I prefer Phoenix and LiveView's approach of writing everything in a single language and having others worry about the required JS, but in this case, adding some of my own alternative JS and adding some PL/pgSQL was a compromise I had to make at least in the short term. With some hard work and good fortune perhaps those can be eradicated from the project in due course.
Right at the top of the list of objectives for this project we have the overarching need to ensure that this vast dataset of unruly datasets from which applications like mine draws its recursive data don't overburden the servers, the user's computer or the bandwidth of the networks that connect them. Phoenix and LiveView is fantastically efficient and elegant but not to an unlimited extent. Everything has limits. One central concept I've been heading towards putting to the test is to split the data into two separate components - one part that's exclusively about how the data is structured hierarchically and specifically how that hierarchy maps onto the rectangle it is being shown stored using IDs only, in while the heavy duty data with the actual content is handled as close to what Phoenix and LiveView caters for. My idea, despite the protestations of some, is to see if I can make that second part work as it should and could using streams.
The version in the initial upload doesn't use any stream yet and uses only the variant of the pagination routine that assumes nothing is loaded yet and starts from scratch. I'm showing a rather small subset of data at a time right now while I'm working on the basic mechanisms, event handling and layout so it's not that surprising that even with complete reloads for any scroll event in any direction the complete reload with assigns is fast enough to create a smooth scrolling experience without loading bars appearing. But that is nothing to go by once multi-user server loads and non-local network bandwidth and latencies comes into play, so the current acceptable performance is merely a testament that the loading routine has enough grunt to not be the bottleneck in live scenarios.