Title
#announcements
juanlu

juanlu

10/20/2022, 10:06 AM
New blog post 🦆 Analyzing 4.6+ million mentions of climate change on Reddit using DuckDB https://www.orchest.io/blog/sql-on-python-part-1-the-simplicity-of-duckdb You can download the Orchest project and inspect the notebooks here https://github.com/astrojuanlu/orchest-duckdb happy hacking!
🦆 1
Rick Lamers

Rick Lamers

10/20/2022, 8:53 PM
This is a really good read if you like Python and SQL. What DuckDB accomplishes in terms of single node performance is nothing short of astonishing 🤯
Jerome Montino

Jerome Montino

10/21/2022, 3:52 AM
I know I just moved my entire ETL from Pandas to Polars and I know Polars team is working right now on doing something similar to duckdb (streaming, etc) but the barrier for duckdb for non-coder analysts is so much lower (as I've been talking to our analysts who I'm also training in SQL and introducing to Python). Really excited for duckdb.
🙌🏼 1
👏 1
Rick Lamers

Rick Lamers

10/21/2022, 10:17 AM
Polars and DuckDB definitely the projects to watch here
10:17 AM
If Polars can cleanly handle larger than memory dataframes it will be a game changer
10:18 AM
Vertical scaling (large nodes eg 128 GB ram + 64 vCPU, 8TB disk) with Polars will beat using cluster approaches due to sheer simplicity (ray, spark, dask)
juanlu

juanlu

10/21/2022, 10:18 AM
If Polars can cleanly handle larger than memory dataframes it will be a game changer
😎 https://twitter.com/RitchieVink/status/1579827660142051328
Jerome Montino

Jerome Montino

10/21/2022, 10:19 AM
Yeah that was the tweet by Ritchie that I read! He coded it right after the duckdb thing. So I guess soon enough we'll get streaming.
Rick Lamers

Rick Lamers

10/21/2022, 10:20 AM
Twee: more streaming queries
Rick: If Polars can cleanly handle
I guess this is the essential bit. He's found a way to do it in principle, but will it be too limited (only for certain operations) or feel awkward? If he can nail it Polars is the king of the hill
Jerome Montino

Jerome Montino

10/21/2022, 10:20 AM
I agree on the simplicity approach, I myself process some geo data, not that big, but definitely much better if I can just throw my 128GB RAM at it rather than go Dask or something.
Rick Lamers

Rick Lamers

10/21/2022, 10:21 AM
better if I can just throw my 128GB RAM at it rather than go Dask or something
Buyrent a bigger machine >>>> more complex setup with many moving parts
Jerome Montino

Jerome Montino

10/21/2022, 10:21 AM
I'm really hoping the awkward bit is solved. Not to make any further digs against Pandas but it's come a long way if we're talking about awkward.
Rick Lamers

Rick Lamers

10/21/2022, 10:21 AM
💯 rooting for Ritchie here 😄
Jerome Montino

Jerome Montino

10/21/2022, 10:22 AM
It's painful is what it is. Imagine having an i9, 128GB RAM, and a 1TB SSD dedicated only for crunching CSVs, then you feed that 2GB geotemporal data CSV file and get killed by Pandas. :headstone:
Rick Lamers

Rick Lamers

10/21/2022, 10:23 AM
Yeah that feels very backwards. I had hopes for https://pypi.org/project/vaex/ too but I'm failing to like its API and it has had some rough edges lately (packaging/installation)
juanlu

juanlu

10/21/2022, 10:25 AM
yeah, I don't think vaex can overcome the current excitement around Polars and DuckDB to be honest. they chose HDF5 as the baseline for historical reasons, but Arrow was the true game changer here. In my view, it will remain a domain-specific tool for Astronomy.
Rick Lamers

Rick Lamers

10/21/2022, 10:38 AM
remain a domain-specific tool for Astronomy
That seems likely.
chose HDF5 as the baseline
This feels like an achilles heel, the Arrow format brings many modern improvements to data formats. It's like a tool based on Virtual Machines vs Containers. The paradigm that will carry the next 10 years is clear.
Jerome Montino

Jerome Montino

10/22/2022, 10:06 AM
Even Wes recognized that Arrow was going to be the game-changer in his blog a while back, which was a good read when I started doing heavier Python stuff. Generally you'd expect a generous 1:2 or 1:3 ratio between your dataset size and the RAM allocation, but Pandas was consuming 5-10 times as Wes pointed out himself. And with data being really stubborn and growing larger than sometimes what is necessary, I'd hate for it to be a battle of RAM sticks (we've already god forbid gone through a battle for GPU units against them crypto boys 😅 ). Ultimately, I think Arrow really takes the cake. It helps that Rust has a native implementation of it, which makes Polars that much more attractive.
Rick Lamers

Rick Lamers

10/23/2022, 10:29 AM
Hard agree here 💯 The 5-10 times RAM isn't compatible with the general desire to do "right sizing" of choosing the node size that fits your workload
9:32 AM
And so we have entered the endgame. 😅
juanlu

juanlu

11/01/2022, 9:32 AM
🔥
9:32 AM
Polars + Out of core will be huge