ChatGPT can basically explain Pachyderm

What AI can tell us about open source software

Jimmy Whitaker
5 min readDec 12, 2022
Cover by author and DALL·E 2

ChatGPT is impressive.

You don’t have to look very far to find even nontechnical people doing amazing things with it. A testament to itself.

Here, I wanted to post a little journey I took with ChatGPT to see what it knows about data processing and more specifically, using Pachyderm, an open source solution for performing data processing.

Data Processing

Let’s start with some questions about data processing to see what we get. We’ll just ask for some general examples of data processing and see what ChatGPT knows.

This is pretty straight forward. It’s a structured data transform that operates on CSV files. And it even gives me a nice description of what it does.

In Pachyderm, data processing scripts can be run by packaging them up as Docker containers. Let’s see what it thinks about containerizing our data processing code.

Overall, this looks pretty good. It’s a reasonable Dockerfile that shows how to package up a python script. And it even roughly explains how to build the container along with the benefits of doing so.

Does ChatGPT know about Pachyderm?

Now for the real test. How much does ChatGPT know about Pachyderm? Because Pachyderm is an open source project and a lot of details are readily available, it’s likely that ChatGPT knows something at least. But let’s see how much.

Let’s ask ChatGPT to create a pipeline definition to run our data processing code.

At first glance this doesn’t seem too crazy. Let’s see about the explanation.

It looks like ChatGPT definitely knows some things about Pachyderm.

  • It specifically mentions data repositories.
  • It does get some of the details about our pipeline spec wrong (e.g. the output field, and that our repos have types)
  • It mentions /pfs in the transformation code (albeit in the path to the transformation file).
  • It knows that Pachyderm pipelines are Docker containers
  • It also knows that output data is written to data repositories (it gets the name wrong though).
  • It knows that pachctlis the CLI to create data repositories and pipelines

Pachyderm Data

Let’s find out what it actually knows about data repositories.

It gets a little repetitive, so we can tell it knows some basic concepts, but they’re not very clear. This is likely an area where we want to clear things up in our documentation somewhere to make this super simple (my assumption is that the source dataset is our docs and source code to some degree).

Now let’s ask a super specific, leading question, about the Pachyderm file system.

This isn’t exactly right.

It seems to not understand that output data repositories are created automatically for a pipeline with the same name as the pipeline. But everything else is pretty accurate. It seems to understand the file system convention, or at least is able to summarize it.

Let’s try to correct ChatGPT and see what it does.

It seems happy with the correction and is much closer to understanding the Pachyderm file system. It now seems to think that the input repository is also created automatically, so it’s shifted a little too far in the opposite direction now. Maybe it’s a little too suggestible at this stage.

Let’s see what happens if I let it try again with this correction. Here I just re-ran the same question to see what type of variety we get.

This time it looks like we got a little variation and the understanding is correct. The output is put in /pfs/out and on top of that, the input data is in a created data repository.

Pachyderm Pipeline Specs

Let’s go back to the pipeline specification and see if the corrections are incorporated beyond the single chat response.

It still knows something about the data repositories, but what happens with the pipeline definition…

Hum.. not too much different from the original.

It doesn’t look like it’s changed much in the pipeline definition from the original, but the additional comment in the output portion contains a note that shows it has “learned something.”

# (which will be automatically created with the same name as the pipeline)

I’m impressed.

Let’s try to correct it again. I don’t have a yaml version of a pipeline handy, so let’s see what it can do with json, while still asking for yaml.

This looks much closer.

Not too bad. Let’s try to correct it once again to see if we can get it a little closer.

I’m giving it a subtly different pipeline with corrected indentation with an explanation of what’s changed.

This all looks right now! It changed the indentation and does look like it’s learned the glob pattern issue as well. It seems happy to learn new things.

Conclusion

ChatGPT knows a good bit about Pachyderm. It’s able to explain how the system works generally, but does get a few details wrong. Honestly, this is really impressive overall. I was especially interested in how it took criticism and corrected its understanding, however it’s unclear if this actually updates the full GPT model itself or if it’s just a temporary state for the current chat. I suspect the latter, currently.

--

--

Jimmy Whitaker

Applying AI the right way | Chief Scientist — AI & Strategy @HPE | Computer Science @UniOfOxford | Published @SpringerCompSci