A Twisted World: 2008

Extending Python with (open source) C++ - Win32 Process Handling

Motivation

My purpose this evening was to write a Python module (in C++) to deal with process handling.This is related to the API spying project I mentioned earlier, where I need to be able to (at least) list and terminate processes.The solution I currently employ relies on WMI (details), and it's painfully slow.Other options are using PDH queries (also slow) or PSAPI + ctypes from Python (faster, but butt-ugly).

Since I've wanted for quite some time now to get my hands dirty and write a Python extension in C++, I figured out this would be an excellent opportunity.I also wanted to try out MinGW/Dev C++, so let's start from there.

MinGW, Dev C++ and DLLs

First, I downloaded MinGW and Dev C++. Dev C++ comes with MinGW bundled, but it hasn't been updated since 2005, so it's a rather old version of the compiler.I installed Dev C++, and then (after I ran into my first problems with Dev C++), I updated the compiler using the separately downloaded version.

Step 1. An empty DLL

In Dev C++: File -> New -> Project, check "C project", select "DLL" from the list and give it a name (in my case, vprocess); save it to an (empty) folder.

Remove the dll.h file from the project; it's not necesary (right-click it in the list on the left and click "Remove file").

Replace the contents of dllmain.c with the following (replacing vprocess with your module name):

#include <windows.h>
#include <Python.h>

__declspec(dllexport) void initvprocess();

PyMethodDef methods[] = {
 {NULL, NULL},
};

__declspec(dllexport) void initvprocess(void) {
    (void)Py_InitModule("vprocess", methods);
}
BOOL APIENTRY DllMain (HINSTANCE hInst, DWORD reason, LPVOID reserved) {
    return TRUE;
}

This is the template of a basic Python module; we'll study it in a moment. For now, notice that the DllMain always returns TRUE (regardless of the "reason" of the call): we have nothing to initialize in this particular module.If initialization operations are required, they can be performed in the init[modulename])() method.

Step 2. A Python module is born

At this point we should save the source and set up project options. In Project -> Project Options:

"Compiler" tab: turn on optimizations (not necesary, but useful);
"Build Options" tab: click "Override output filename" and enter the module name; the extension must be .pyd so as Python can find it and recognise it as a (binary) module;
"Parameters" tab: click "Add Library or Object", browse to the Python installation folder, "LIBS" subfolder, select libpython25.a.

Note: yes, it's libpython25.a, NOT python25.lib, as one might assume. At first I linked against python25.lib, and everything went fine up until I used the Py_None object, at which point I got this lovely linker error:

[Linker error] undefined reference to `_imp___Py_NoneStruct'

After a number of failed attempts to fix it (Google didn't come up with a solution/explanation, although the problem appears to be pretty common), I finally noticed something about libpython25.a in one of the results and figured out I might as well try that too. Jackpot!

The final configuration step is adding the Python "libs" and "include" paths to the compiler directory list. Tools -> Compiler Options -> Directories; in the "Libraries" tab add the "libs" subfolder of your Python installation (e.g. C:\Python25\libs), and in the "C Includes" tab add the "include" subfolder.

At this point you should have a compilable, yet useless Python module. Time to add some meat on it.

Step 3. Python objects & Windows Processes

We'll implement two methods, one that kills a process given it's PID and one that lists all running PIDs and returns a dictionary whose keys are PIDs and whose values are the executable file names of the respective PIDs.

The module methods must be declared in the call to Py_InitModule() inside the (exported) init[modulename] function as a list passed to the second argument to the function.In the above code, the methods variable contains a list of (C) tuples containing the module methods, the functions which implement them and the argument passing method (METH_VARARGS in our case will do just fine).All methods (must) return a PyObject*, and they must be decorated with __declspec(dllexport) so as the DLL will export them.

Step 3.1. killPID

First, let's look at the C code which, given a PID, obtains a handle of the corresponding process and calls TerminateProcess on it:

    HANDLE process;
    DWORD result;

    result = 0;
    process = OpenProcess(PROCESS_TERMINATE, 0, pid);
    if (process != INVALID_HANDLE_VALUE) {
       if (TerminateProcess(process, 0))
          result = 1;
       CloseHandle(process);
    }

The code is pretty straight-forward: it tries to get a hanndle (with the PROCESS_TERMINATE flag set) of the process with the given PID. If successful, it tries calling TerminateProcess on the handle and closes it.The pid variable will be obtained from the argument(s) passed to the function.If anything failed, the result variable will be set to 0; if everything worked properly, it will be set to 1.Also note that for the process-handling code to work, the tlhelp32.h header file must be included.

All that's left to do is to get the pid from the arguments using PyArg_ParseTuple and return the result variable as a Python integer (using Py_BuildValue to convert it):

__declspec(dllexport) PyObject* vprocess_killPID(PyObject *self, PyObject *args) {
    // kills a process given by PID
    // returns 1 if successful, 0 otherwise
    HANDLE process;
    DWORD pid, result;

    result = 0;
    PyArg_ParseTuple(args, "k", &pid);
    process = OpenProcess(PROCESS_TERMINATE, 0, pid);
    if (process != INVALID_HANDLE_VALUE) {
       if (TerminateProcess(process, 0))
          result = 1;
       CloseHandle(process);
    }

    return Py_BuildValue("k", result);
}

As is usually the case, error checking has been forsaken in the name of simplicity (laziness also helped).

Step 3.1. snapshot

This function will take a snapshot of the running processes and generate a Python dictionary as described above.

__declspec(dllexport) PyObject* vprocess_snapshot(PyObject *self, PyObject *args) {
    // generates a dictionary whose keys are process IDs and values are process names;
    // returns None if the snapshot fails
    PROCESSENTRY32 pe;
    HANDLE snapshot;
    PyObject* result;
    DWORD i;

    result = PyDict_New();
    snapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
    if (snapshot == INVALID_HANDLE_VALUE) {
       Py_INCREF(Py_None);
       return Py_None;
    }

    pe.dwSize = sizeof(PROCESSENTRY32);
    if (Process32First(snapshot, &pe)) {
       do PyDict_SetItem(result, Py_BuildValue("k", pe.th32ProcessID), Py_BuildValue("s", pe.szExeFile));
       while (Process32Next(snapshot, &pe));
    }
    else {
       Py_INCREF(Py_None);
       return Py_None;
    }

 CloseHandle(snapshot);
 return result;
}

Interesting here is the method of building a dictionary, item by item.

Step 4. Putting it all together

The last step is to inform Python about the exported functions:

PyMethodDef methods[] = {
 {"snapshot", vprocess_snapshot, METH_VARARGS},
 {"killPID", vprocess_killPID, METH_VARARGS},
 {NULL, NULL},
};

Source code, compiled module and example .py source available here.

CreateProcessInternal function prototype

Hooking process creation APIs

Today I had to deal with the problem of logging the creation of new processes in an API spying project. The options were to:

hook each function that deals with process creation separately (WinExec, ShellExecute*, ShellExecuteEx* and CreateProcess*), or
find a function called by all of the above (and if possible, the highest one in the call tree)

The latter is the obvious choice, since it requires much less code and is easier to parse in the generated logs. At this point, the choices were:

NtCreateSection (or better yet, NtCreateProcessEx) from ntdll.dll (the process of obtaining the process name from the handle passed to the function is described here), and
CreateProcessInternal from kernel32.dll, which unfortunately is an internal function that Google knows nothing about.

Again I chose the latter option, and did some digging to find out what gets passed to CreateProcessInternal(W).

CreateProcessInternalW

On Windows XP SP2 (possibly other versions too), it looks something like this:

DWORD WINAPI CreateProcessInternal(
  __in         DWORD unknown1,                              // always (?) NULL
  __in_opt     LPCTSTR lpApplicationName,
  __inout_opt  LPTSTR lpCommandLine,
  __in_opt     LPSECURITY_ATTRIBUTES lpProcessAttributes,
  __in_opt     LPSECURITY_ATTRIBUTES lpThreadAttributes,
  __in         BOOL bInheritHandles,
  __in         DWORD dwCreationFlags,
  __in_opt     LPVOID lpEnvironment,
  __in_opt     LPCTSTR lpCurrentDirectory,
  __in         LPSTARTUPINFO lpStartupInfo,
  __out        LPPROCESS_INFORMATION lpProcessInformation,
  __in         DWORD unknown2                               // always (?) NULL
);

If you're accustomed to the Win32 API, you've probably noticed that the arguments are the same as the ones passed to CreateProcess, except for the first and the last ones, which always appear to be NULL.

CreateProcessInternalA calls CreateProcessInternalW internally (no pun intended), and (at some point) so do all the other process-creation APIs mentioned above, so CreateProcessInternalW is an excellent API to hook in order to catch process creations. The lpCommandLine argument contains both the executable image path and the arguments.

A better alternative

A better solution would be hooking NtCreateProcessEx in ntdll.dll (prototype "documented" here), which is at the lowest possible level in user mode. The process name is in the ObjectName:PUNICODE_STRING field of the ObjectAttributes:OBJECT_ATTRIBUTES argument ("documented" here). For my purposes, however, CreateProcessInternalW was plentifully enough.

A small downside is loosing the program arguments, but a solution could probably be found.

In the WTF?! department

As a piece of fun trivia, here is the implementation of the (internal) CreateProcessInternalWSecure API on the same Windows XP SP2:

; Exported entry 102. CreateProcessInternalWSecure
_CreateProcessInternalWSecure@0 proc near
C3                          retn
_CreateProcessInternalWSecure@0 endp

Yes, it's actually just a RET. Why in the world would Microsoft need a dummy **internal** API called CreateProcessInternalWSecure is far beyond my understanding. If it were published, backwards-compatibility would be a possible explanation, but since it's hidden, it makes very little sense.

How to: download videos from YouTube/Dailymotion from Python

Recently, thanks to a good friend, I stumbled over a very expressive French singer called Jacques Brel, or rather yet, over some of his music videos on YouTube. Normally, when I find a video I want to keep, I use the (Java)script from 1024k.de. However, this time I decided wanted a script to which I could feed a set of links and it would download the .flv files all by itself. Here goes the process of building it:

1. Figure out how to get the link to the .flv from the video page source

This is easy to accomplish using URL Snooper.

First, (install and) start URL Snooper and press "Sniff Network"; the "Protocol Filter" should be set to "Show All".
Open the video page in your favorite browser (mine is Opera, by the way). Let's use this video as an example: Jacques Brel - Amsterdam.
In the keyword filter, copy & paste this: get_video?video_id. There should be two links showing up in the list at the bottom, a longer one and a shorter one, starting with "http://youtube.com/get_video?video_id=". What follows after that is what we need from the page source. We already know the video_id; it's pk7YxDzjTxA (the original URL being http://youtube.com/watch?v=pk7YxDzjTxA). From the page source we must somehow get the "t" parameter; this is the key to the whole thing.
A unique one is generated everytime the page is requested, and a cookie linked to it is also stored. It usually starts with "OE" followed by some "random" letters and numbers (ie: t=OEgsXoDSdfK8pTloMKr2p6gfC7hfAOsf).
Now that we know what we need, we must find the "t" parameter in the page source code. Searching for it's OE[...] value, we find a line that starts with
```
var fullscreenUrl = '/watch_fullscreen?[...]
```
and at some point contains the "&t=OE[...]" parameter. Jackpot!

2. Download the video's page from Python & parse it

Downloading a page from Python is pretty straight-forward; the twist in our case comes from the requirement to use cookies.

A (very) brief introduction to cookies in Python can be found here. Basically, you use a "cookie jar" which stores the cookies sent by the remote server; the jar is passed to the urllib2 opener. To keep things simple, we'll use the following download function:

def download(url, userAgent = '', cookieJar = None):
    args = []
    if cookieJar: args.append(urllib2.HTTPCookieProcessor(cookieJar))
    uo = urllib2.build_opener(*args)
    uo.addheaders = [('User-agent', userAgent or 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')]
    lnk = uo.open(url)
    data = lnk.read()
    return data

It's arguments are the URL to be downloaded, the user agent (we spoof IE 6.0 if no user agent string is given; some sites weed out automatic requests based on the user agent, and we don't want that...) and the cookie jar, which we create like this:

cj = cookielib.LWPCookieJar('myCookieFile.cookie')

Obviousely, the previous lines of code require urllib2 and cookielib to be imported.

The next step is to actually download the video's page and parse it using regular expressions to get the "t" parameter. Assuming the URL of the video gets passed as an argument to our script, this is how it would be done:

data = download(sys.argv[1], cookieJar = cj)
m = re.search('video_id=(.+?)&.+&t=(.+?)&hl=', data)
if not m:
   print 'Video ID/t not found!'
   sys.exit()
id,t = m.groups()

The above lines of code download the video page (actually, the first parameter passed to the script, which *should* be the video URL) and use a regular expression to find the video_id & t parameters. Of course we could get a list of URLs from a text file and pass the name of that text file to the script, but that's left as an exercise to the reader (don't you just love it when that happens?...).

3. Actually downloading the .flv video

Next step: getting the video file and saving it. Not much left to do; simply use the parameters we've got and the link format we know from URL Snooper:

video = download('http://www.youtube.com/get_video?video_id=%s&t=%s' % (id, t), cookieJar = cj)
open('%s.flv', 'wb').write(video)

Et voila!

Summary

For the lazier people, here's a Python script that takes as an argument an URL of a video and downloads it:

import urllib2, cookielib, re, os, sys

def download(url, userAgent = '', cookieJar = None):
    args = []
    if cookieJar: args.append(urllib2.HTTPCookieProcessor(cookieJar))
    uo = urllib2.build_opener(*args)
    uo.addheaders = [('User-agent', userAgent or 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')]
    lnk = uo.open(url)
    data = lnk.read()
    return data

cj = cookielib.LWPCookieJar('my.cookie')

data = download(sys.argv[1], cookieJar = cj)
m = re.search('video_id=(.+?)&.+&t=(.+?)&hl=', data)
if not m:
   print 'Video ID/t not found!'
   sys.exit()
id,t = m.groups()

video = download('http://www.youtube.com/get_video?video_id=%s&t=%s' % (id, t), cookieJar = cj)
open('%s.flv' % id, 'wb').write(video)

if os.path.isfile('my.cookie'): os.remove('my.cookie')

Usage example (assuming you saved the script as getvid.py):

getvid.py http://youtube.com/watch?v=pk7YxDzjTxA

Please note that this code is not meant to help people in mirroring the contents of YouTube.com... It's an attempt (feeble, perhaps) to present a hands-on approach to solving every-day tasks in Python, which hopefully some will find enlightening or at least a tiny bit helpful.

P.S.

And the reader thinks to himself in disappointment: "But you promised Dailymotion.com too in the title!". It's less challenging than YouTube, so I won't go through the process in detailed steps. We use the same algorithm as above (usable on mostly any video site):

Using URL Snooper, you can find the link format in the same way as described above; it turns out to have the following pattern:
```
http://www.dailymotion.com/get/[some-number]/320x240/flv/[some-alphanums].flv?key=[hex-digits]
```
Looking through the page source, we find the pattern (url-encoded) in a line looking something like this:
```
[random-alphanums].addVariable("video", "%2Fget%2F16%2F320x240%2Fflv%2F[random-alphanums].flv
```
We extract the interesting portion with the following regular expression:
```
r'(%2Fget%2F[^"]+?\.flv%3Fkey%3D[^"%]+)["%]'
```
apply urllib.unquote (note that here it's urllib, not urllib2!) on the result and append it to the http://www.dailymotion.com host to get the full URL.
Having gotten the URL to the .flv file, we download it. In dailymotion.com's case, cookies aren't required.

A Twisted World