Stavros' Stuff

On programming and other things.

Writing a full-text search engine using Bloom filters

Search engines and Bloom filters: A match made in heaven?

A few minutes ago I came across a Hacker News post that detailed a method of adding search to your static site. As you probably know, adding search to a static site is a bit tricky, because you can’t just send the query to a server and have the server process it and return the results. If you want full-text search, you have to implement something like an inverted index.

How an inverted index works

An inverted index is a data structure that basically maps every word in every document to the ID of the document it can be found in. For example, such an index might look like {"python": [1, 3, 6], "raspberry": [3, 7, 19]}. To find the documents that mention both “python” and “raspberry”, you look those terms up in the index and find the common document IDs (in our example, that is only document with ID 3).

However, when you have very long documents with varied words, this can grow a lot. It’s a hefty data structure, and, when you want to implement a client-side search engine, every byte you transmit counts.

Client-side search engine caveats

The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there. What static site generators do is generate every required file when generating your site, then making those available for the client to download. Usually, search-engine plugins limit themselves to tags and titles, to cut down on the amount of information that needs to be transmitted. How do we reduce the size? Easy, use a

Continue reading…

Reject unPEP8 pushes with git hooks

PEP8 or GTFO

Like many people in our profession, I am a bit of a stickler when it comes to style and consistency. I like my code (and my team’s code) to have a specific style, and to stick to it, which is why I love tools like go fmt.

The Python equivalent is, of course, PEP 8. Although I don’t agree with everything (I like spaces but think tabs make more sense in theory, because they separate presentation and formatting), I observe PEP 8 because I think it is important to have one consistent style, no matter what it is.

Sometimes, however, code which doesn’t follow that style slips past my editor. In those cases, I would like my git server to reject my commits and tell me what’s wrong so I can fix it.

Continue reading…

No-hassle emails from your home server

Emails: Why is sending them so hard?

Since I like to think I’m pretty handy with computers, I like to have one of them running all the time. To that end, I bought an HP ProLiant Microserver some time ago, and always keep it on. It’s a great little low-power server, and I use it for things like storing all my files so I can access it from computers around the house, automatically making on- and off-site backups every day, allowing access to my home networks, etc.

This setup is fairly straightforward, but there is one considerable pain point: The server needs to send email. It’s currently using ZFS with RAID-Z to store my files, and I would like to be notified if there are some disk errors so I can replace the disks. The system can easily send emails, but having those emails actually get delivered and not go into spam is a very different matter.

Mail nowadays: Not so easy.

If you’ve ever tried to run your own mailserver, you know that ensuring deliverability is hard. Mail servers on the internet today are very suspicious of new servers, due to the spam problem, so if you just start sending email from your new server and your home connection, one of two things will happen: Best case, it will end up in your spam inbox, and, worst case, it will be blocked and you’ll never hear about it.

The consequences of this can range from the inconvenient to the disastrous, depending on how important the email that was being sent is, so it’s easy to see that our server’s sending capabilities have to be as robust as possible. However, this isn’t a production mail system in a large application, so how can we solve the problem without spending a lot of time on it? Well, there’s a very easy way.

Continue reading…

Winning at Candy Crush

The Candy Crush Saga saga

I find Flash games on Facebook great fun. Not playing them, of course, that’s boring. As you may remember from my previous post, “winning at Puzzle Adventures“, I like to take a look into their guts and figure out how they work, and whether or not I can get insane scores with no effort.

When I discovered Candy Crush Saga, I was intrigued. All my friends appeared mad about this game, sending me so many requests for candy that their dentist would surely commit harakiri. I started playing a bit, and it wasn’t long until I had to stop playing, since the game only allows you a set number of lives per hour in an attempt to either extract money from you or coax you into spamming your friends with requests for the game, to increase its popularity.

Cheating at online games

This, however, wouldn’t do, so I fired up the Swiss army knife of web debugging, Charles Proxy (it’s a fantastic tool for this job). I started looking at the requests the game was making to the server, and saw one that looked promising:

Continue reading…

Writing a FUSE filesystem in Python

Turns out FUSE filesystems are ridiculously easy!

If you’re a regular reader, you might have noticed that I’ve been on a quest for the perfect backup program, and ended up writing my own encryption layer over bup.

While writing encbup, I wasn’t very satisfied with having to download the entire huge archive just to restore a file, and still wished that I could use EncFS together with rdiff-backup to have true remote-mountable, encrypted, deduplicated, versioned backups.

Trying obnam again (spoiler: it’s still pretty slow), I noticed that it included a mount command. Looking at it, I discovered fuse-python and fusepy, and realized that writing a FUSE filesystem in Python is pretty much trivial.

The astute observer will have already realized where I’m going with this: I decided to write an encrypted filesystem layer in Python! This layer would be very similar to EncFS, with a few crucial differences:

Continue reading…

Encrypted, deduplicated remote backups

Why are secure backups so hard?

Note: Be sure to check the sequel to this post, about the program that will supersede this one and be compatible with all backup utilities.

Backing things up is important, and, luckily, there are many high-quality services geared to everyday people that are very easy to use and cheap. Unfortunately, I am not everyday people, as I am very paranoid and insist that absolutely nobody be able to see my photos of my dog and lawn. It’s a matter of privacy.

To that end, I’ve long been looking for a secure/encrypted backups service, but I haven’t managed to find a single service or tool that fulfils my requirements:

  • Cheap to store data on (~$30 per year for

Continue reading…

How to pronounce "gyros" (the greek food)

Pronounce "gyros" correctly, impress your friends!

In a recent discussion on reddit, it seems that a lot people are wondering how the word “gyros” is actually pronounced. As a Greek, I feel it is my duty and responsibility to clear this right up, authoritatively demonstrating exactly how it’s done.

Here’s how a Greek (me) pronounces the word “gyros” (you might have to enable plugins or something to listen to this, it’s a SoundCloud recording):

Continue reading…

DIY internet-enabled bathroom scale

Wherein my weight is broadcast live to the good people of the internet.

A few days ago, I looked under the couch and found my dusty, disused Wii Balance Board. I bought it years ago, when I was a bit chubbier and thought Wii Fit might help me lose some weight and become fitter. It worked very very well, although I think it was mostly because I didn’t want to eat junk any more, as that would mean that the mind-numbingly boring hour of exercise I just did would be for naught.

For those of you who don’t know what a Wii Balance Board is, it’s the bastard offspring of a bathroom scale and a step pad. It connects to the Wii via Bluetooth, and it can weigh you and also tell which way you are leaning.

Seeing the board, I thought it would be fun to connect it to the computer and try to read the weight values from a script. I started by trying to pair it with the computer, and

Continue reading…

Authentication and rate limiting

Bank websites: Intelligently designed, or randomly evolved?

Yesterday, I tried to log in to my bank’s website for the first time in a few months. I couldn’t remember my password, because I change them frequently, so I tried a password, then another, and then another, which is, I hope, what most reasonable people do when they forget their password.

To my great dismay, after the third attempt, I got a message saying “Your account has been locked. Please call the bank to unlock it”. Given that this is my company bank, which is in the UK, and I am in Greece, this is extremely inconvenient. I now hate my bank (more than before).

Here are a few tips, if you are developing any sort of application that has authentication/logins, although I feel I will be preaching to the choir:

Continue reading…