Python: list vs. set

In Python we have several types of objects for storing values in a
array-like way: lists, tuples, dictionaries (they are really more like
hash tables) and sets. Lists and sets might look alike but they are
different. Lets do a quick test:

saghul@hal:~$ ipython
Python 2.6.6rc1+ (r266rc1:83691, Aug 5 2010, 17:07:04)
Type "copyright", "credits" or "license" for more information.
IPython 0.10 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.
In [1]: import random
In [2]: l = [random.randint(0, 100) for n in xrange(0,100)]
In [3]: s = set(l)
In [4]: def sum_items(iter):
...: num = 0
...: for i in iter:
...: num += i
In [5]: %timeit sum_items(l)
100000 loops, best of 3: 8.99 us per loop
In [6]: %timeit sum_items(s)
100000 loops, best of 3: 6.87 us per loop
In [7]: %timeit 69 in l
1000000 loops, best of 3: 1.99 us per loop
In [8]: %timeit 69 in s
10000000 loops, best of 3: 140 ns per loop
view raw gistfile1.txt hosted with ❤ by GitHub

In the above snippet, we can see that iterating over all items in a

list takes just a little longer tan a a set: 8.99us vs 6.87us we are
talking micro-seconds here!

Now, if we look at how long it takes to verify is the list or set
contains an object, we can see the difference: 1.99us vs 140ns. Sets
are an order of magnitude faster!

Why is that? We can think that a python list is like a C linked list.
It’s time complexity when checking for a value is O(n). Sets, on the
other hand, are implemented as hash tables, so the key is hashed and
instantaneously found (or not) with a time complexity of O(1).

For the curious, this is the list object declaration in Include/listobject.h

and this is the set object declaration, in Include/setobject.h

typedef struct {
long hash; /* cached hash code for the entry key */
PyObject *key;
} setentry;
This data structure is shared by set and frozenset objects.
typedef struct _setobject PySetObject;
struct _setobject {
Py_ssize_t fill; /* # Active + # Dummy */
Py_ssize_t used; /* # Active */
/* The table contains mask + 1 slots, and that's a power of 2.
* We store the mask instead of the size because the mask is more
* frequently needed.
Py_ssize_t mask;
/* table points to smalltable for small tables, else to
* additional malloc'ed memory. table is never NULL! This rule
* saves repeated runtime null-tests.
setentry *table;
setentry *(*lookup)(PySetObject *so, PyObject *key, long hash);
setentry smalltable[PySet_MINSIZE];
long hash; /* only used by frozenset objects */
PyObject *weakreflist; /* List of weak references */
PyAPI_DATA(PyTypeObject) PySet_Type;
PyAPI_DATA(PyTypeObject) PyFrozenSet_Type;
view raw gistfile1.c hosted with ❤ by GitHub

More on python and hash tables here:


Choosing a JSON library

When I download a library for using whatever API I usually see that people tend to use different JSON libraries, and sometimes I just don’t have it installed, but I know I got another one which could do the job just fine.

A while ago I ran across this while I was having a look at the Tropo WebAPI library ( and ended up adding the following code:

    import cjson as jsonlib
    jsonlib.dumps = jsonlib.encode
    jsonlib.loads = jsonlib.decode
except ImportError:
        from django.utils import simplejson as jsonlib
    except ImportError:
            import simplejson as jsonlib
        except ImportError:
            import json as jsonlib

Most used JSON libraries are ordered regarding speed. You might want to read a nice comparison between libraries here:


Creating a “smart” Twitter bot

A while ago I was bored at home and decided to play a bit with the Twitter API. As a result I created a simple Twitter Bot which retewitted everything with the *#asterisk* hashtag (>

It’s functionality was very simple andI didn’t even touch the (crappy) code for months, but Twitter decided to drop the basic authentication support in favor of OAuth so it was the right time for updating the bot.

Since the bot was created lots of other bots have, and it feels annoying to get the same tweet over and over again, so I though I’d try to make the bot a bit *smarter*. The main goal was to detect if a tweet was already tweeted (or RTd) not to repeat things. AFAIK there are 2 types of retweets:

  • “some text RT @user1 RT @user2 some super cool tweet here!”
  • “some super cool tweet! (via @user)”

The current bot detects both formats by using regular expressions and extracts the tweet itself. However, I felt this could not be enough, because we could get the retweet *before* the real tweet so it’s be great to save the entire tweet and then look for it.

As the bot was using a SQLite database I decided to use its *Full Text Search* (fts3) capability, inspired by a colleague. With FTS searching for an entire string is amazingly faster than doing a regular SELECT query. Here is an example taken from the SQLite website

CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */
CREATE TABLE enrondata2(content TEXT); /* Ordinary table */

SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */

The (silly) code for this bot is available on GitHub:
Happy tweeting!

pydmesg: dmesg with human readable timestamps

Since I used dmesg for the first time I felt there was something wrong about it. Having very accurate timestamps might of course be helpful for many people, but sometimes you just want to know when something happened.

dmesg prints timestamps in the form of seconds.nanoseconds since the system booted. And no, there seems to be no -h option to make it human readable.

Today I felt like writing some Python for that, and pydmesg is the result. Is a simple script that fetches the uptime from /proc/uptime and uses it to print nice dmesg timestamps (timestamp format can be changed by editing the file).


[499902.343696] uvcvideo: Failed to query (1) UVC ...
[499902.354633] uvcvideo: Failed to query (1) UVC ...
[530442.358520] npviewer.bin[8818]: segfault at ...


[2010-08-21 13:12:37] uvcvideo: Failed to query (1) UVC ...
[2010-08-21 13:12:37] uvcvideo: Failed to query (1) UVC ...
[2010-08-21 21:41:37] npviewer.bin [8818]: segfault at ...

By default precision is set to the second, which I guess is ok for
human beings ;–)

#!/usr/bin/env python
# coding=utf8
# Copyright (C) 2010 Saúl ibarra Corretgé <>
pydmesg: dmesg with human-readable timestamps
from __future__ import with_statement
import re
import subprocess
import sys
from datetime import datetime, timedelta
_datetime_format = "%Y-%m-%d %H:%M:%S"
_dmesg_line_regex = re.compile("^\[\s*(?P<time>\d+\.\d+)\](?P<line>.*)$")
def exec_process(cmdline, silent, input=None, **kwargs):
"""Execute a subprocess and returns the returncode, stdout buffer and stderr buffer.
Optionally prints stdout and stderr while running."""
sub = subprocess.Popen(cmdline, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, **kwargs)
stdout, stderr = sub.communicate(input=input)
returncode = sub.returncode
if not silent:
except OSError,e:
if e.errno == 2:
raise RuntimeError('"%s" is not present on this system' % cmdline[0])
if returncode != 0:
raise RuntimeError('Got return value %d while executing "%s", stderr output was:\n%s' % (returncode, " ".join(cmdline), stderr.rstrip("\n")))
return stdout
def human_dmesg():
now =
uptime_diff = None
with open('/proc/uptime') as f:
uptime_diff =[0]
except IndexError:
uptime = now - timedelta(seconds=int(uptime_diff.split('.')[0]), microseconds=int(uptime_diff.split('.')[1]))
except IndexError:
dmesg_data = exec_process(['dmesg'], True)
for line in dmesg_data.split('\n'):
if not line:
match = _dmesg_line_regex.match(line)
if match:
seconds = int(match.groupdict().get('time', '').split('.')[0])
nanoseconds = int(match.groupdict().get('time', '').split('.')[1])
microseconds = int(round(nanoseconds * 0.001))
line = match.groupdict().get('line', '')
t = uptime + timedelta(seconds=seconds, microseconds=microseconds)
except IndexError:
print "[%s]%s" % (t.strftime(_datetime_format), line)
if __name__ == '__main__':
view raw pydmesg hosted with ❤ by GitHub

print “hello world!”


I used (and still do!) blog about VoIP in Spanish at, but at some point I wanted to blog a bit on coding stuff, specially Python, and I’m starting today.

Frankly, I’m tired of maintaining a complete blog/CMS like WordPress, I wanted something simple with which I could share some code snippets. Being a blog about code, syntax highlighting was a must. Posterous does a nice job on this by supporting Markdown and algo integration with Gist, a great service GitHub provides. Moreover, Posterous allows me to post by sending an email, I’ve never seen something like this!

So, consider this the hello world :–)

print "hello world!"