Sandboxing Python dependencies in your code

Published in

InfoSec Write-ups

8 min readJul 10, 2022

Running code from an untrusted source is still an unsolved issue.
Especially in dynamic languages like Python and Javascript.
I will begin with 2 unanswered questions;

If you import requests for http, why should requests be able to open a terminal and switch to sudo?
If you import logging, Why should it be able to network (or LDAP like in Log4Shell) if you only need to write files to a specific directory?

This is the story of how I wrote a sandbox for python imports:
Creating a production-ready solution and testing it for different use cases.

TL;DR

The solution looks as follows. GitHub link at the bottom.
How pickle can be exploited in your 3rd party packages?

>>> import pickle
>>> class Demo:
...     def __reduce__(self):
...         return (eval, ("__import__('os').system('echo Exploited!')",))
... 
>>> pickle.dumps(Demo())
b"\x80\x04\x95F\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x04eval\x94\x93\x94\x8c*__import__('os').system('echo Exploited!')\x94\x85\x94R\x94."
>>> pickle.loads(b"\x80\x04\x95F\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x04eval\x94\x93\x94\x8c*__import__('os').system('echo Exploited!')\x94\x85\x94R\x94.")Exploited!
0

With secimport, you can control such actions to do whatever you want:

In [1]: import secimport
In [2]: pickle = secimport.secure_import("pickle")
In [3]: pickle.loads(b"\x80\x04\x95F\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x04eval\x94\x93\x94\x8c*__import__('os').system('echo Exploited!')\x94\x85\x94R\x94.")[1]    28027 killed     ipython

Some **AI art by min-DALLE for “secimport”**. GitHub link at the bottom.

Adding a new library to our code might be challenging for several reasons:

Developers cannot know exactly what is expected from a package.
You import it, yet inside the package can do whatever it wants with your environment without you knowing.
It is hard to tell what are the minimal requirements it needs to run that library, just for your use case.
What is the set of system calls one should allow, so it will function properly, and nothing more?

We trust the open-source community. Yet the maintainers of the packages we use are individuals. There’s a big risk when versions are not locked with ==; Our packages update silently in your CI, and new code runs inside these packages without us knowing.

On a daily basis, someone anonymously uploads a malicious wheel to PyPI.
Sometimes it’s a package you already use.

One morning, someone with a worldwide used pip package woke up hating Russia. To protest, he deletes the hard drives of the installers in Russia and Belarus upon installation of his python package, by checking the IP and running os.system upon import (or something like that). History repeats itself, so it is just a matter of time before it happens again… This is too easy.

Worthless to mention, you cannot review the entire code when adding a new open source (but Google and Apple can).

Today’s solutions are mostly out-of-band and per process.

If a given module includes vulnerabilities, malicious logic, or a big codebase for a very small task, we have to confine it somehow.
The host (computer) should stay untouched by your application or its third-party apps.

So what can I do today?

Create a sec-comp (Linux secure computing) profile for the entire application. If your program will behave in an unexpected way, be it due to 3rd party library or your own code, it will be logged or the process will be killed immediately.
e.g RedHad, SE-Linux, SECCOMP, all that great stuff.
Static code analysis/security scanners in your CI/CD or IDE.
You look for outdated packages or you analyze the code.
e.g: Snyk, CheckMarx, Clair.
Not using open source or 3rd party software in that codebase at all.
That’s unfeasible at scale… but I know a few small startups that don’t use open source in their code at all. That’s hardcore — Imagine implementing any logic you do from scratch.
WASI sandboxing — WebAssembly is great!
But, unlike Rust/Go/.Net, Python does not compile to WebAssembly, so this solution is irrelevant to us at the moment (rust-python anyone?).
Running the software in a VM or Hypervisor (different from containers).
Google developed a sandbox for containers, called gVisor.
gVisor is a kind of VM that translates each syscall in your application.
For this project, Google implemented Linux from scratch, in Go.
The sandbox binary (that runs containers) is around ~17MB.
Each syscall is translated to gVisor and then to the Host, while advanced policies are enforced. Pretty impressive, right?

Yet, running applications inside VMs often result in performance degradation. I have personally experienced (great) sandboxes that slowed down the application by up to 50%.

Google uses gVisor to isolate containers in the Google App Engine.

Just like Google, I Assume open-source packages will always be an attack surface for your application. But it does not mean you should change the way you code. Each module should have its own restrictions, defined by the developer or chosen from some template.

How to constrain modules in our python process

Instead of having a unified profile for the application, I wanted to enable developers to limit any package in their code with a given scope, at import/compile time.
Just like SELinux grants a whitelisted scope for a process in the Linux kernel at execution time, I want to enable developers to control any package in their code, in production, under specific constraints.

Log4Shell someone? Why should Log4J be able to open LDAP connections by default?

Confining Python Modules — MVP

I wanted a tool that can log each python call and each syscall.
I assume nothing malicious between us and the kernel can affect visibility.
sounds hard, and keeping the performance the same sounds even harder.
Tracing/Sandboxing to this level usually affects the runtime performance.

Implementing such a tool can be done using technologies like:

eBPF
DTrace
any other .*trace tool.

I know eBPF is common these days, but we need something cross platform, that gives more value for time with faster learning curve and easy setup for hands-on evaluation.
We cannot assume every python dev will know C for eBPF. I assume it can be replaced with eBPF easily after the MVP (eBPF is also called “DTrace 2.0”).

After reading enough blogs and trying different tools,
I understood that dtrace was the right thing to start with for this use case. The way I see it, unlike eBPF, dtrace does not require compiling a kernel in a certain way (not built-in in every Linux out of the box).
dtrace works on Mac and Windows, opening any dtrace-based solution available to more users.
dtrace is also Destructive, meaning it can kill a process from the dscript probes that monitor the python process. That’s exactly what I want.

Look at the following image; Instead of containers we have python modules, and instead of SELinux we have dtrace, probing the kernel.

1. Running a python process:

2. Run a dtrace process in the background:

3. Run whatever you like to cover

4. dtrace output:

Amazing! we can see the posix_spawn syscall was called (4th row).

In this example, I’ve used “dtrace -n” to pass a hook to dtrace.
I have expanded this dtrace command to a dscript, which is a way to store these hooks and program these probes to do what we want.

After an example, I wrote a dscript program (script file) that kills a process when a specific python module calls a `spawn` syscall.

A dscript program that kills a python process when `spawn` syscall is called by a specific python module.

I implemented it efficiently using something called Associative Arrays in the dscript language.
I implemented a python wrapper for the variables I wanted in the script, and I created a template for the dtrace file content.

Then, I wrote “secimport”!

Example for the MVP version of “secure import”, or “secimport”.

secimport is a python package that can be used to:

Confine/Restrict specific python modules inside your production environment.
Open Source, 3rd party from untrusted sources.
Audit the flow of your python application at the user-space/os/kernel level.
Run an entire python application under a unified configuration
Kind of a seccomp for python modules. Cross-platform.

Networking Example

>>> import requests
>>> requests.get('https://google.com')
<Response [200]>
  

>>> from secimport import secure_import
>>> requests = secure_import('requests', allow_networking=False)

# The next call should kill the process,
# because we disallowed networking for the requests module.>>> requests.get('https://google.com')
[1]    86664 killed

Shell Example

Python 3.10.0 (default, May  2 2022, 21:43:20) [Clang 13.0.0 (clang-1300.0.27.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

# Let's import subprocess module.
>>> import secimport
>>> subprocess = secimport.secure_import("subprocess", allow_shells=False)

# Let's import os 
>>> import os
>>> os.system("ps")
  PID TTY           TIME CMD
 2022 ttys000    0:00.61 /bin/zsh -l
50092 ttys001    0:04.66 /bin/zsh -l
75860 ttys001    0:00.13 python
0
# It worked as expected, returning exit code 0.


# Now, let's try to invoke the same logic using a different module, "subprocess", that was imported using "secure_import":
>>> subprocess.check_call('ps')[1]    75860 killed     python

# Damn! That's cool.

The dtrace profile for the module was saved under: /tmp/.secimport/sandbox_subprocess.d
The log file:
/tmp/.secimport/sandbox_subprocess.log

Conclusion

It seems like the security community needs a sandbox that’s capable of confining specific modules in your code while keeping it in the same process.
I presented a way to handle 3rd party code inside our codebase.

Source Code:
https://github.com/avilum/secimport

Examples: https://github.com/avilum/secimport/blob/master/docs/EXAMPLES.md

If I made it possible for a dynamic language like python, I’m sure that the community will be able to implement instrumentations for other languages in a few lines of code.

Part 2: Securing PyTorch Models with eBPF

Securing PyTorch Models with eBPF

This article was not generated by GPT

infosecwriteups.com

Thank you for reading this far.

If you liked this article, I welcome you to check some of my previous releases:

How I Discovered Thousands of Open Databases on AWS

My journey on finding and reporting databases with sensitive data about Fortune-500 companies, Hospitals, Crypto…

infosecwriteups.com

POC For Google Phishing In 10 Minutes: ɢoogletranslate.com

Back in 2016, I ran into a post about someone buying ɢoogle.com. It was used for phishing proposes (notice the first…

infosecwriteups.com

Identify Website Users By Client Port Scanning — Using WebAssembly And Go

Websites tend to scan the open ports of their users, from the browser, to identify new/returning users better. Can…

infosecwriteups.com

Facebook Knows What You Eat: Discover The Entire Data Facebook Collects About You, Step By Step.

A story of how I explored https://facebook.com/dyi programmatically.

medium.com

Sandboxing Python dependencies in your code

TL;DR

Today’s solutions are mostly out-of-band and per process.

How to constrain modules in our python process

Confining Python Modules — MVP

Then, I wrote “secimport”!

Networking Example

Shell Example

Conclusion

Part 2: Securing PyTorch Models with eBPF

Securing PyTorch Models with eBPF

This article was not generated by GPT

Thank you for reading this far.

How I Discovered Thousands of Open Databases on AWS

My journey on finding and reporting databases with sensitive data about Fortune-500 companies, Hospitals, Crypto…

POC For Google Phishing In 10 Minutes: ɢoogletranslate.com

Back in 2016, I ran into a post about someone buying ɢoogle.com. It was used for phishing proposes (notice the first…

Identify Website Users By Client Port Scanning — Using WebAssembly And Go

Websites tend to scan the open ports of their users, from the browser, to identify new/returning users better. Can…

Facebook Knows What You Eat: Discover The Entire Data Facebook Collects About You, Step By Step.

A story of how I explored https://facebook.com/dyi programmatically.

Written by Avi Lumelsky