I am curious - from a pure pipeline workflow POV, how is this conceptually unique and better?
for example, for your quickstart example, i could take each one of those files and combine it into 1 large single notebook and run and have the same outcome, no? a jupyter notebook is all batch and runs sequentially too.
your homepage says your main features are 1. Visual pipeline editor (sure node-based is better UX but doesn't change workflow itself) 2. Code in Notebooks (same as regular jupyter) 3. Jobs (can run ipynb with shell scripts with cron) and 4. Environments (for local usage not much need). anything major i'm missing?
10/03/2021, 8:11 PM
The key points of differentiation are:
• Graphical editor/viewer for pipelines
• Native integration of notebooks as pipeline steps to avoid monolithic notebooks
• An abstraction over containers to make dependency management simple, yet provide reliable and isolated deployable units (containers) to deploy flexibly in clusters (transparently to then user)
The point of these features is not so much about enabling fundamentally new things. But integrating that experience into a shared environment (multi-user) that standardizes on many of the decisions that with a manual setup could be done differently by each data scientist/ML Engineer/data analyst.
Another thing worth mentioning about jobs is that they snapshot the code and configuration (env variables, parameters) which makes them inspectable by you and your collaborators.
Not mentioned on the website, but a pretty big feature is the concept of services. Which expands the pipeline conceptual model with running services like a Streamlit app server. But all integrated into a single declarative file that can be versioned.
Another use of services is to integrate with more complex authentication flows such as OAuth 2.0 that require endpoints to be available. That can help typical data pipelines that integrate with other SaaS tooling.
10/05/2021, 11:03 AM
Thanks for clarifying the points above. I had similar thoughts to Kay. Another question is: why use this platform over creating a pipeline with AWS Lambda? Alot of the processes there are also standardised, and you get reliable and isolated deployable units? Pardon me if its a noob question. Still early into Data Science
10/05/2021, 1:22 PM
That is a great question. I think AWS Lambda can serve you well if you have relatively few steps that have a well known specification ahead of time. Orchest is more suited for iterative development workflows where you might want to explore many different ideas over the course of a couple of weeks. In that sense it’s more Data Science than Data Engineering.
In addition, the interaction model is pretty different. Orchest has a more “workbench/IDE/GUI” style workflow v.s. a YAML config/CLI driven one.
10/06/2021, 12:07 PM
Thanks for clarifying Rick. Could you please expand on the concept of services a bit more, maybe through an example? A more broad question, how do you ensure people use your tool as opposed to leveraging cloud functions once they gain confidence in what they are doing? I'mt trying to weigh learning two tools (Orchest and Lamdbas) as opposed to just one. Hope that makes sense and you can empathise where I'm coming from?
10/06/2021, 3:41 PM
Totally understand. One perspective on learning cloud functions v.s. learning a cloud agnostic tool is how confident you are on building directly on top of a cloud specific set of abstractions. It's similar to learning generic K8S vs. AWS Fargate. This obviously has strong implications on lock-in. Orchest has the perspective of building OSS that can run anywhere. On-prem, any cloud vendor, or on your own device.
Services can basically be thought of as the ability to run containerized(/Dockerized) applications as part of your pipeline. Whether it's Streamlit, a generic HTTP server, a MongoDB server or TensorBoard. It allows you to take any containerized application and integrate it with your data pipelines directly.
Makes sense! I like the point on the vendor lock in very much. Did you personally find yourself struggle with cloud specific abstractions, and find the friction to hinder iteration? Is that the motivation for starting Orchest?
Lastly, a friend who works at a Robotics startup in SF mentioned that their team do everything with Databricks notebooks and PySpark. Orchest provides subtle additional benefits like services. However, barring this and the ability to create pipelines with a graphical editor, other features you outline above seem to in large part be addressed by DataBricks as well. Am I right in understanding then that the beauty of Orchest stems from the additional concept of services?
To be transparent (since the above can feel quite blunt/an attack on Orchest which this is not), I'm merely trying to ascertain industry solutions and why people have chosen some over others. The stars on Github evidence the success of the platform.
10/07/2021, 12:54 PM
No worries, these are great questions. It helps everyone build a more informed perspective.
I think a key difference between Databricks and Orchest is the focus of Databricks on Spark and the JVM ecosystem.
We generally recommend users to use modern PyData oriented alternatives to Spark such as Dask or Ray for scale out processing.
Another significant difference is that Databricks the platform isn’t open source. So you can’t self host it in the environments you want. It’s only available as a service at various cloud providers.
I think we take a pretty different approach to orchestration with Orchest as well. While Databricks does support jobs and visual representations if those jobs as DAGs, its process to create these jobs with interdependent steps is fundamentally different between the two tools.
We find that our approach is a bit more accessible and amenable to quick iteration. This is particularly helpful during development.
Note, if anyone feels that I’m misrepresenting anything about Databricks please do correct me. I only have a surface level understanding of the intricacies of their solution (which is always evolving too, of course).