Don’t Eat the Pickle!

This blog was created by Ataf Fazledin Ahamed, Software Security Engineer, OpenRefactory and edited by Charlie Bedard.


For the last month and a half, I have been looking at the source code of some of the most popular open source software projects as a part of my job. The task sounds simple. We use various SAST (Static Application Security Testing) tools to scan the projects and if any bugs or vulnerabilities are found, we manually triage them to see whether the bugs are actually real bugs. We started with some of the most popular open source software written in Python. As a result, I had to go through various Python source files to get an understanding about what was happening inside.

What Did I Find?

One of the most frequently found issues generated by the SAST tools was for the Python library called pickle. You may already be familiar with it or may even have used it for saving your machine learning models as binary files. If so, then as designed, the next time you want to run your model, you read the file and load the model from it. Well, the pickle module does exactly what it was built to do- create a byte-stream like representation of a Python object. The process of storing data into a file-system friendly format so that it can be later read back is called “Serialization”. The opposite process is called- “De-serialization”, where the data is read from the friendly formatted file and converted back to the original Python object form. There is a particular type of vulnerability called “Unsafe/Insecure Deserialization” where user controlled data is de-serialized in an unsafe manner. In this manner, an attacker can perform malicious activities through the serialized data.

What About It, Anyway?

Well, Python’s pickle module converts a Python object into a byte-stream representation. And then the byte-stream is written to a file, often called a “pickle file”. Later, the “pickle file” is read by another user and the Python object is loaded back into its original format. In this manner, if someone chooses to create a malicious “pickle file” and send it to someone, then upon de-serializing the file- the other person can fall victim to an “Unsafe/Insecure Deserialization”.

How Does It Happen?

During our source code analysis, we found that many existing Python open source modules use the pickle library. However, not all of them were vulnerable. The vulnerability of the software depended on the attack surface- the way the pickle library was used. Python classes/objects have a very special method called  __reduce__. This method is called whenever an object is de-serialized from a byte-representation. By using pickle to de-serialize/unpickle the data, this method gets called by the Python interpreter. A hostile attacker could override the __reduce__ method telling the interpreter to perform an rm command to remove our /home/ directory. Now let’s see two examples of how we can exploit this vulnerability.

Scenario #1

Let’s say I am taking a class: “CSE 472: Machine Learning Sessional”. Like with many other classes I am taking, I was asked to submit my trained CNN (Convolutional Neural Network) model as a “pickle file” and told to upload it to the cloud (Google Drive, OneDrive, etc.) and then share that link in the Moodle LMS along with any necessary Python script(s). Let’s say, for some reason, I was too busy with my department picnic that I forgot to complete the assignment. So, I searched Google, GitHub and found a pickle file for a trained CNN model. I downloaded it.
Before submission, I wrote a script (something like the above) to check if the model is working correctly or not. After running the script, I couldn’t find my home directory!

It’s basically a very-simple virus that deletes your home directory. When I loaded the “pickle file” using the pickle library, it deleted my home directory. Sigh.

Scenario #2

Suppose that, in the current Computer Science teaching session, the first and second-year students are struggling to cope with the curriculum. The Java-based networking class requirements have been changed- now they can do it using Python. For the networking environment, there, of course, needs to be a server and multiple clients. One of the basic ways of working with a server is by using sockets as shown below-

The server runs on localhost on port 6666. Anyone can connect to the server at port 6666 from the same computer. If the computer is connected to a local network, anyone from another computer can also connect to the server using server machine’s local IP address. Let’s say you implemented your server as above and now I want to do to you what happened with me in the first scenario.
I wrote the script above and sent the payload to your computer ( If your server is active and it accepts connection, your home directory will be removed. And your expression will be like-

Now, the socketserver module of Python is used to implement server-side sockets. And it uses two file-like objects called rfile and wfile. Whenever data comes into the server socket, it gets written to the rfile object and if any data is to be sent, it’s to be written on the wfile object.

Well, the thing is- just because socketserver uses rfile and wfile as file-like objects, it doesn’t mean that we should treat them like files. For sending and receiving complex data, you used the pickle module blindly. And that’s where you made the mistake. You could have and should have used other methods (read, write, read-byte, write-byte) provided by the Python’s file interface to pass data through the socket.

Is This All?

You might be wondering whether the main weakness behind these two attacks was the pickle module or the __reduce__ method. The answer is- neither of them. Both pickle and the __reduce__ method did what they’re intended to do.

The main reason these two attacks took place was your usage of the pickle library. In the first scenario, I shouldn’t have de-serialized/unpickled the untrusted data from internet. And in the second scenario, it’s the basic functionality of a server to accept connections through sockets. But the socket receiver shouldn’t have used the pickle library to handle the incoming data.

What Can We Do?

Well, actually there’s nothing much to be done with the pickle library. If your Python object is simple (eg. can be represented using dictionary, list) consider using JSON, YAML, or TOML format to save it to a file. But if your data is complex, such as weights of a neural network, then you can continue using pickle . But you have to be sure the files are safe before using pickle files from untrusted sources.

If that’s not an option, try using a Sandbox environment for testing. And for implementing servers, don’t use pickle because it’s easily attacked. Pickle was never meant to be used this way. Use normal file operations to read and write. It’s your responsibility to use this library carefully, otherwise it can lead to something dangerous and disastrous, like this-

Recent Posts