concept: employer company filter

French-Press-Coffee-MakerLast week I was presented with an interesting challenge. I’ve been looking around a bit for companies and I would like to work for. Talking to a recruiter who had no idea how to slim down his list of clients he just passed me the list. The list contained a little over 250 company name. A lot for me to read through!

As a developer I tend to be picky in the companies I would be interested in and even more picky in the ones I will talk to. So I knew that out of those 250 company names by definition at least 50% were useless to me, and probably even 75%. I tend not to be the type of person that will only say yes to A, B and C. I’m more of a person that will say I’m not going to go for D, E and F. For me personally it comes down to this;

  1. I’m not doing agencies, I already have my own.
  2. I will not work for corporations working on a product that has been out for decades.
  3. I prefer start-ups but will also consider any other start-up type of construction. (Note my definition of start-up is very narrow compared to the sense that it’s being used today!)
  4. I look for projects that try and make a dent in the world. Things that try to disrupt or try and add a new dimension to society.

So because I’m a developer I made some code for that!


It’s complicated! I’m not publishing the code as it’s ripe for the scrapyard but the basic concept should be enough to get you going if you would like to reproduce!


Getting data

When looking at the preferences I see that 1 and 2 are probably the easiest to filter out and would reduce my list of 250 down fairly quickly. So fairly quickly I decided that it was a good idea to do JSON searches to Google to get me the correct websites and then scrape the website for the home page text and the about page text if that existed. Although not working straight away I managed with some tweaking to get that done fairly easily in Ruby within 40 minutes. I stored all the text of the website in a MongoDB storage and then the hard part started!

Filtering agencies

So I had all the data but I now needed to filter out all the ones that would not be interesting to me. So I started out with agencies as those in my mind were the easiest to find. I was working with a mixed Dutch/English website base so I loaded the data into an elasticsearch index and started doing some searches by hand. This allowed me to experiment a bit and figure out what would and what wouldn’t work. For agencies I looked in the text for some of the obvious like “agency” and “bureau”. But I also did a search for; “full-service”, “clients”, “portfolio” and “klanten” to apply a frequency factor so that if it would found in the text several times it would also be a “hit”. This slimmed down my list of 250+ to about 110 company names. More than 50% gone on the basis that they presented themselves as an agency on their website.

Filtering corporations

This was a tough one! I tried running a list of well known company names through the database but it only yielded a couple of results. So I used LinkedIn to give me a list of companies with 500 or more employees from the Netherlands. With some fiddling with a private API key I was able to get it done. I cross referenced that list with my list and I was able to remove another 30 companies giving me a total of about 80 companies left.

Filtering by hand, first try

I felt secure enough to have a go at looking through them by hand and judging on my own. My time to asses a company by hand was about 2 minutes minimum and after 8 companies I already stumbled upon 3 advertizing companies. Although I have nothing against advertisers and I’m aware the web is largely free because of them. But I’m not that interested in building marketing and advertizing platforms if it’s not connected to an actual product that users get to use. So I quickly did another search like I did with the filtering of agencies to filter these companies out as well leaving me with about 40 companies to look through.

Filtering by hand, second try

So, 2,5 hours later and one hour of searching by hand I had a couple of company names I could give back to the recruiter. 3,5 hours spend filtering 250 companies. A quick calculation tells me that with me doing the last 40 companies in about 60 minutes the total of 250 by hand would have taken me about 6 hours and 15 minutes. Code wins again!

Problems & pitfalls

So as always it didn’t go off completely without problems. the last by hand search also yielded a company without a website and a company that was closed about a year ago but the website still being up. Although I could blame the recruiter there I do have to keep in mind that all input is EVIL! So running into these types a problems should be normal.

I did have to use some private API keys to get to the data from LinkedIn and I really had to pace my search through to Google search API as I was kicked out a couple of times. Although with no cost to me the scripts would have been much more efficient if these types of API’s would allow for bulk searches instead of having to do a search for every name by itself. But than again, that type of search is an edge case not many would need. Although the business search was very useful in some cases but I found irregularities in them as well.

So… WHY!

Okay… it seems kind of overkill that I actually went through the list at all. 250 companies and only a 16% yield of viable companies and from that about 1% of after doing the manual search!? Well as I receive the list that was my first response as well; “why the hell should I have a look at this?” But then I was like; “Why not? I haven’t done a lot of coding this week and I have something to do while waiting for responses from other companies!”. So, I spend some time with it, learned some new things, did what I liked doing and got a couple of new companies out of the exercise as a bonus!

Where is the code?

Basically the code was part actual tasks, some command line hacking and lot’s of experiments without any tests. Should I release it? Probably not! Right now the code I have lying around is ready for the scrapyard. poorly written, undocumented and it would probably be best if it was rewritten. I told you how you could do it and if you want to spend some time doing it properly!? Do so! If you ask nicely I might even give you a helping hand.

, , , , , , , ,

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: