1 – Deep Web: many reviewers ….but how many users?
Many internet posts haven’t any doubt: the so-called Deep Web (DW) is definitely a bad place. Would you like your checking account to become empty? Aren’t you afraid that someone could spy on you? Or, from bad to worse that the police burst into your flat and arrest you like in American crime movie? If you aren’t chasing things like that, you’d rather stay in your daily sandbox made of Google, Facebook and so on. A lot of reviewers are quite sure: among them, names such as Wired, Le Iene, Repubblica.
Probably, it’s because that I didn’t ever like any ‘for dummies’ guide. Or it can be that several arguments in many cited posts show contradictions. It could also be, frankly speaking, that the current Internet we all are used to, doesn’t sound to me any better than its Nineties version. Even the impression that who condemned without hesitation the whole DW based his judgement only on a quick tour and a few quotations sometimes partially understood played a role.
Anyway, in this article, I decided to try to do a different reflection on the DW, with the main aim to understand how great is really and what could we lose, continuing to ignore it.
2 – Deep Web and Surface Web in some alternative interpretations
First of all, what is the DW? It’s easy, said a lot of reviewers (almost always basing on a Bright Planet report for NDS, dated 2001): it’s anything that cannot be indexed by usual search engines like Google, Bing, Yahoo etc. Obviously this defines by opposition also the “surface” or “ordinary” web.
And, at second, what is the argument that in most of the cases has made following? Just that the DW is huge, obscure and dangerous: riding the metaphor of the deep ocean and in full coherence with a great deal of web business rhetoric. Which is more and more committed to putting across a well-known message: you can totally trust in the safety of what you’re doing, if you surf the internet only looking for ”normal” websites and you got a set of classic software and technologies such as antivirus, anti-malware, firewall and so on.
Many elements, therefore, let this sound at least questionable. Not only because some authors (mainly antagonist but also part of mainstream such as Federico Rampini) pointed out that the ordinary web is not so different from the mighty Big Brother, due to its systematic anti-privacy data collecting strategies. There is, also, a significative number of examples where several IT majors are found guilty of remarkable stories of tax evasion and mistreatment of workers. There are, in addition, the so-called shock sites (rotten.com and similar), proving that even the absence of macabre and disturbing contents in the indexed web it’s a not so plain argument. Keywords like ransomware, Snowden, Stuxnet, Hacking Team and, recently, Vault 7 shows finally another key concept: villains aren’t only in the DW and probably neither can only be found among activists and not so popular groups and institutions.
A lot of theses, based on the concept that the ordinary web and its top players (Jobs, Zuckerberg, Page&Brin etc…) it’s not so good as it looks like, reconsider the DW often stressing the authenticity perspective. The DW would be, in other words, the only web where nowadays, survives the spirit, fundamentally anarchist and strongly believing in the personal freedom, which was one of the distinctive features of the first-year internet and probably of the technologic infrastructure (TCP/IP) philosophy, too. A scenery where almost all was a hacker, that means someone who use something in a creative manner, often not fully intended or documented manner.
From this point of view, the DW surfing without prejudices gives us a lot of confirmations. As we’ll better see in next parts of the article, topics like hacking, IT security and freedom of information are very developed in “deep” pages whereas are significantly reducing in “surface” ones. Here law regulations (especially about copyright topic) together with marketing and public security needs and, in some countries, the censorship, has made harmful, for who wants to post them, writing borderline contents. So these subjects moved away from the surface towards the deep environment. With “borderline” that may mean, in some country and in some moment, contrary to law but cannot necessarily mean opposite to ethics or lacking any moral explanation, such as it happens with some contents related to ICT full disclosure, extreme environmentalism, no global or anti-copyright movements and so on.
3 – What does it contains and … how large is it?
We probably figured out that the DW is not evil as a whole or that, anyway, someone could watch at some of its parts in this way. Moreover, how many are the more or less “acceptable” areas above the total (from a perspective more ethic that legal, for the reasons explained before)? And, finally, how vast is this terrifying DW, in particular compared with the ordinary web where this article is posted?
Here we touched another key point of the reasoning that could contribute to partially dismantle the mediatic argument of a DW “huge and unfathomable”. The dimensions ratio between deep and ordinary web written almost everywhere is 400:1, that is 8000:20 TB, as these are the figures reported in the previously cited Brightplanet white paper. Almost nobody has written anything about the consequences of this fact. Almost all, indeed, have accepted it without any hesitation, combining instead those numbers with a pretty photo or infographic displaying an iceberg.
These figures are, however, quite problematic, especially if we try to link the 20 TB esteem with some well-known and accepted evaluations about the nowadays internet. These assessments state about 1 million of hosts, 2MB as the weight for the standard web page, an average of 5 page per host (or, in other words, per website).The doubt is, basically, the following: can we accept that the whole internet of 2001 was contained in 20 ordinary PC or laptops (that is 2000 host of 16 years ago, assuming 10 GB as the standard space provided by an average hard disk in 2001)? It’s reasonable that, supposing constant in these years the ratio 400:1, the nowadays DW contains 4,4 EB (1 EB = 1.000.000 TB), about 1/3 of the whole space provided by the data centre of Google (15 EB)? How many activists (or terrorists, arms dealers, paedophiles and so on) would be needed to write and maintain all these pages?
It’s not this, obviously, the right place to put too much maths (that should take into account a lot of other elements, first of all, the traffic figures) but maybe it is already clear that a different explanation of the 4.4 EB is possible or probably needed. The simplest and, in many ways, more solid hypothesis is the following: the DW is enormous only due to the consequences of an incorrect definition which put into it almost the whole “behind the scenes” of the surface web.
There’s nothing shady, difficult and, in general, so interesting in the very strong majority of the “deep” Exabytes. These are merely server-side pages, built dynamically and retrieving information from often very big internal databases: typical examples are many online marketplaces. Often those pages are introduced from a search form which stops a lot of bots; so they are invisible to spiders and not indexed, nothing more, nothing less. A lot of not indexed bytes can be associated, then, with streaming contents as it happens with Youtube. In this case, the HTML page that contains static contents such as text, images, styles, generally consists of not more than the classical 2MB. Here the music video, for example, is only represented by an URL, whose bytes are directly sent to the client device which manages buffering and visualisation. In a few words: 2MB are the surface (indexed) content, 200 MB (a standard 10-20 minutes MP4) are the deep content: here a 1:100 ratio similar to the mysterious 1:400 reported previously.
So well, adopting a different definition of the DW, more coherent with the nowadays internet, it’s quite obvious that a significant part of the mediatic narrations, simply, starts vanishing. Therefore, what finally remains?
There are, at first, P2P contents whose behaviour is partially similar to video and dynamic contents discussed before, as they start existing on the web only after the peer to peer connection has been established. It’s likely that their contribute, in terms of average online content (depending on the number of seeders, on their online times and on the number of downloaders), is far less than the corresponding 10% share of the total traffic due to them at present (CISCO). One could guess, so, an occupation of few percentual points of the total DW size. The real danger and immorality implied by these contents are, certainly, very variable. A torrent file can link, for example, both to a perfectly ethic and worldwide law compliant ISO image of a Linux distribution, to some ripped audio contents, to pedo-pornographic material and so on.
There are, at second, many normal web pages (often personal or amateur) not indexed by the main search engines. You can surf them with any web browser and, obviously, they can’t be anything too strange, as anyone (police and government included) can visit them without any particular trick.
There are, finally, all contents that are inaccessible using ordinary tools like those in networks such as TOR, I2P, Freenet, IRC, XMPP, ONIONCAT etc.. How much static space (according to the original “8000 TB” definition) would they require? Hosts of the TOR environment had been assessed in 100.000 in 2001 and 300.000 in 2007, so it’s quite reasonable an actual figure of about 500.000 hosts, corresponding to 5-10 TB of web space (compared, another time, with some EB of total web size). These pages are said to be online, on average, less than three days (video).
A full analysis of this so-called DarkNet is very difficult. At first, you cannot simply surf its pages using a traditional browser: you have to use a specific one, called TOR (The Onion Router) which, using a muti-tier technology in some ways remembering the skin of an onion, hides your IP address and so a significant part of your personal data. Although installing TOR (or using a corresponding online service like TOR2WEB) it’s not just a complex operation, it anyway cuts off a lot of low technological skilled internet users. And anyway, merely using TOR or TOR2WEB (simpler but highly insecure), you can reach only some plain or non-restricted .onion websites such as the Pirate Bay. A lot of other domains remain out of your scope because they use more powerful technologies to allow explorations only to specific and really interested users; sometimes or maybe often interested in something immoral or illegal.
For all these reasons, we can consider that analysis beyond the purpose of this post. You can easily find, anyway, some academic papers, research projects or PhD theses dedicated to this topic. To be fully honest, a lot of these (CRYPTO) highlighted that hidden services, although designed for praiseworthy aims, such as the protection of the freedom of thought and information especially under dictatorships and tyrannies, at present are quite often (50%-60% of total) used for illegal activities, with a clear prevalence of the totally unacceptable ones (even in liberal regimes) over the ethically acceptable or neutral. This, however, cannot compromise our main argument that is, staying away from the DW tout court could be not the better choice. Something that, hopefully, will be fully understandable reading the next paragraph.
4 – Walking through a dangerous neighborhood
So, we found a part of the web that, to some extents, can be considered a danger zone and, to some others, can be evaluated very interesting, mainly because free from a lot of the filters (the one over the politically correct, at first, but not only this) that are transforming our surfing experience in a mere trip through our “favorites” folder. This forces us to answer a question like “What would we rather do with that micro-world?”.
The way of ignoring it could be, obviously, a solution. But is this pretty alike to an escape, isn’t it? Not simply from a virtual place or from a technology, but from the risks which looking for knowledge has always implied.
Probably the real interesting element of DW is that it obligates you to choose a personal way of exploring. You can make a lot of different choices and you are, actually, put face to face with your real nature of student, researcher or simple citizen. Are you ready to risk something to know or understand something? Or do you prefer safety above all, real or illusion that it could be?
Personally speaking, I found very interesting the search in the DW for contents about hacking and IT security, anonymity/privacy and conspiracy theories. This, adopting a very small number of shrewdnesses explained below, done with a very high level of safety.
I started always from the Hidden Wiki (HW), a sort of topic-based directory of the DW. Many consider this the natural gate for the DW and, due to its simplicity and relative richness of information, I would probably agree with them. Anyway, even in this first pages, you can realise that the perspective has changed radically from the surface. There, for example, Google is Google, but here, as anyone is anonymous, you will never know if you are watching the real or the correct Hidden Wiki … simply because it doesn’t exist!
So the HW is a representation of the Darknet, although fairly limited and likely optimistic … but, actually, what does it contain? The following table tries to classify the topics of the sites cited (obtained from descriptive texts), grouped by category and keywords:
A deliberately sectorial insight has been made following that first analysis. I ignored to view all the pages about topics that, in some way, I felt disturbing (drugs, arms, paedophilia) or barely interesting such as sex and dating.
This produced a list of a few .onion websites (that is, you need TOR or TOR2WEB to see them) that can return an image of the waves I surfed:
- Hidden Wiki – The Matrix (A possible starting point and an interesting manifesto of a significant part of the DW vision of the world)
- TOR Search Engines: Not Evil (Claims more than 1.000.000 of indexed hosts … will it be true?)
- Ground Zero (Hacker academy)
- Facebook as hidden service (Real or fake?)
- Hidden Answers (A soft of DW Yahoo Answers)
- Project Strongbox – New Yorker Secure Drop (Whistleblowers of all world … you’re welcome!)
- ParaZite (A very rich encyclopaedia of anarchism and extreme activists)
- Imperial Library of Trantor (The Alessandria’s library of no-copyright people)
- TOR Against CP (Hackers mobilised against paedophiles websites)
At this point, somebody might ask himself how many risks may involve an exploration like the one I made. Well, this is not a technical post, but let me write a few basic IT security consideration which, unexpectedly, almost nobody between who stressed the argument of the dangers implied in the DW surfing, made. It sounds simply like this: No one can steal a non-existing information!
Therefore, one thing is to face off the less or more dangerous waves of DW (but surface ones are really so calm?) with your home PC, mounting Windows XP and a low protection profile and, above all, with a hard disk full of personal data, images, PINs and passwords. Another this, totally different, is making the same thing even using the same PC but using a live Operative System such as Linux Tails which in the default configuration cannot access anything other than the RAM memory and which, in addition, has been designed to leave locally and remotely as few traces as possible.
To sum up, how large is the real danger liked to the DW exploration? Well, it’s not a quotation but probably monsieur Jacques de La Palice would have said: it depends on what you make it and how you make it.
5 – After the walk.
After a walk through the DW (I didn’t become a real regular client but, sometimes, I return there to continue my exploration), usually, I come back in a good mood.
It could be a result of the simple and naive enough graphics, with few banners, video or intrusive popups, which are so common in plenty of the nowadays surface sites. It could all connected, either, to a sort of nostalgia of the atmospheres of movies like Wargames or Sneakers which I grew with and that someway are recalled to my memory by those pages. Or it could be, simply, a consequence of my choices, by which I limited my experience to a proper subset of a container that, like it’s surface corresponding, include the whole and its contrary.
At present no one has arrested me, my bank account has the same little money it had before and no special SWAT task force has irrupted in my flat. But here, perhaps, all concerns, more than web or crypto politics, a basic rule I read somewhere that sounded like this: let always stay your brain on, between your screen and your keyboard.
Toc, toc …….. sorry, someone is knocking at my door …. I have to go ……
The transposition at the current date (2017) of the size reported in the 2001 BrightPlanet paper is approximate and comes out from this simple assessment: 4.4 EB = 2Mbytes/Pag*5*109Pag*400 (DW/SW ratio). I would appreciate very much to receive comments, remarks or better esteems. Note that BrightPlanet itself, during, affirmed this: “These questions are impossible to answer now as the Internet has turned into a rapidly expanding database that can only be classified as infinite. So how big is the biggest database in existence? The answer, ‘nobody knows’”.