Scribd introduces copyright filter
I think that this approach is reasonable enough, but I'm skeptical that it will actually prevent the unauthorized posting of material to Scribd. However "fuzzy" the Scribd text-matching is, it's likely that determined pirates will figure out how to exceed its threshold and get around it. And it's also unlikely that Scribd's database will ever comprise a significant fraction of all copyrighted works. Finally, it's easy to imagine that pirates could have the best of both worlds by posting material to other web-hosts that don't have the text-matching in place (that is, every web-host except Scribd, from your local ISP to LiveJournal, Blogger and Wordpress) and then posting files that link to those hosts on Scribd.
The good that this will do largely revolves around people who aren't sure if the material they're posting is or isn't in copyright -- these folks will be notified that the works they're posting aren't kosher. So there's some good that comes of it.
Scribd has been unfairly targeted as a haven for pirates, a company that relies on infringement to line its pockets. The reality is that Scribd is not anywhere near the top of the list of sites that end up hosting infringing material. Any site that offers free hosting to the public will have infringing stuff on it -- for example, LiveJournal, where the Science Fiction and Fantasy Writers of America host their public conversations, has far more infringing works (photos, texts, etc) than Scribd does -- naturally, as it is much larger than Scribd. Flickr probably runs neck-and-neck with them. YouTube dwarfs all of them.
Scribd has also been accused of being obstreperous in its removals process -- again, without any basis in fact. Every hosting site has a near-identical procedure for removal of material: you fill in a DMCA "takedown notice" in which you swear that you're the rightsholder or an authorized representative, and they take the material down. This process is unfairly characterized as burdensome, even though this process allows rightsholders the power to have material removed from the Web without showing any evidence that there's any infringement going on.
Compare this with the offline world: if you believed that a bookstore was carrying an infringing edition of your book, you couldn't just walk off the street, sign an affidavit swearing that the book was really your work, and expect the bookstore owner to take it off the shelf (for starters, this would be a disaster for free speech, as every axe-grinding yahoo would be able to get books censored just by filling in false affidavits). No, you'd have to hire a lawyer, go to court, prove a case, and then the work would come down. When you unpack the claims of SFWA members who had material removed from Scribd, the complaint amounts to, "They made me swear that this was my book before they removed it."
The process for removing things from the Web isn't burdensome. It is so easy that it is frequently abused -- everyone from the Church of Scientology to Diebold have used takedown notices to silence their ideological opponents by falsely claiming to have been infringed upon.
Scribd is trying to find something that will make SFWA happy, but the members who bruit about the shibboleth that Scribd is an extraordinary bad actor are not basing their claims on the reality of the situation. Scribd is no different from any of the services that SFWA members use every day, from Google Mail to Blogger, from LiveJournal to Flickr: a commercial entity that provides a low-cost place for the public to express itself. Every one of these services is abused by infringers, and every one of them has exactly the same procedure for addressing infringement.
Indeed, Scribd has already shown its willingness to set aside the law and take extraordinary measures to make SFWA happy, as when it honored a malformed and sloppy "takedown notice" sent by SFWA Vice President Andrew Burt, one that listed dozens of works that Burt did not have the authority to represent, including several that were under Creative Commons licenses (including my own first novel, Down and Out in the Magic Kingdom).
Apologists for this claim that no real harm was done, that the ill that arose from it was that my book "was unavailable from one source for a few days." This is far from the truth: when SFWA had my work taken down, it caused many of my readers to believe that I had abandoned my commitment to free sharing of my works and to write to me accusing me of being a hypocrite, swearing never to buy my books again, and so on. Only by publishing the facts of the matter -- that I had not caused the book to be removed, that SFWA had acted against my explicit prohibition on their acting as my representative for copyright claims -- was I able to communicate to all the people who'd seen the page saying my book was offline because copying it was prohibited that I was not behind this.
The good news is that SFWA's Copyright Committee has a new chairman, Russell Davis, whose public notice on assuming the chair are very heartening and promising indeed. Link
See also:
Science Fiction Writers of America abuses the DMCA
Science Fiction Writers of America reinstates E-Piracy Committee -- new name, same chairman


the latest
latest episodes
See Scalzi's site. Burt is out as chairman of the copyrights committee.
Of course you could employ more subtle tricks, such as replacing letters with different Unicode symbols that look practically identical on screen. However, it isn't prohibitively difficult to write software that can detect that kind of manipulation. (I'm very certain here, I'm a software engineer.) Plagiarism-detecting software is becoming pretty sophisticated. It can even detect instances of plagiarims in contexts where it's possible to completely rephrase and restructure a work, such as computer source code (all the variable names can be changed arbitrarily!) and technical papers (where the exact qualities of the prose are irrelevant, the essence of the work is on a level no software can possibly understand). A story or novel, on the other hand, still has to look very much like the original; I can't think of a way of introducing changes that could throw the text-matcher off and would still be tolerable to readers.
Of course that doesn't mean it can't be done; as I said, I'm just not sure.
With every additional word you put on a sentence, the more unique the construction is.
Example: Google "I like coffee" - 400k hits; "I like coffee and tea" - 6k hits.
You don't need any sort of fuzzy-matching. If two texts have more than a few sentences overlap, you can immediately flag it as a probable clone.
In terms of character sets, its really simple to translate everything to unicode or something similar for comparison.
For people to bypass filters, they'd have to restructure text (or predict where sampling would happen) - and that starts going from copyright to plagiarism, where you have to decide what is fair use, what is inspired, and what is copied. You wouldn't need to run plagiarism software though, because the worries around about people copying term papers of news articles as their own - the worry is about redistributing copyrighted works.
Audio, Video & Illustrations are hard to detect/compare; text is dead simple. Its
@Jonathan_V: No, a few sentences overlap does not indicate a probable clone. Think about sentences like "He rang the doorbell and waited.". How many unique stories can you think of off-hand that could include exactly that same sentence while not being anything remotely alike? I can easily think of two completely different novels having, somewhere in them, perhaps two dozen such absolutely identical sentences. And let's not get into the question of a review of a book where a couple of paragraphs of a 300-page novel are quoted and analyzed. That's almost certainly fair use and not a copyright violation, yet it'd trigger your proposed filter. And if that review is put up first, it can prevent the copyright holder of the reviewed work from posting his own copy since his copy would trigger a match with another copyrighted work. Scribd can't even resolve that by removing the review, since the person who wrote the review can legitimately claim copyright ownership on it and now Scribd is deliberately refusing to protect his copyrighted work.
Todd at #4: The filter works by building a semantic map of a copyrighted work in toto, and does not reject uploads unless a significant match is found. Reasonable fair use and quotations are allowed for in the system.
Can't the pirates just ROT13 the text, and the potential "customers" of the pirates just ROT13 their search criteria? Any such replacement scheme would work, and if you want to read the work you just download the text and reverse-ROT13 it (or whichever replacement cheme it uses).
That's fine (for the pirate) until the site owner updates the filter to apply ROT13 to files before checking for matches again. Then the pirates apply another coding scheme, which works fine until the site gets wise, and so on. Classic germ-immunity arms race, with the burden seemingly on the immunizer/site to filter for all possible schemes. OTOH, the pirate has to keep coming up with new encoding schemes, and propagate information about them to the illicit downloaders without the site getting wise to them for as long as possible.
Yes, except the pirate can put the instructions for decoding his text, in cleartext, at the top of the file.
What's more, blocking files because they contain n identical sentences gets into deep and thorny questions of fair use. Let's leave aside the classic case of a quotation that's fair use (for parody, scholarship, criticism, etc).
No, let's consider the hard case: I write a piece that (lawfully) quotes five lines from a story by Ted Chiang. I register that piece as "mine."
Ted comes along and uploads the store I quoted. Now the system says, "Sorry, that story contains two lines from a piece that Cory Doctorow controls. You may not post it."
One semi-joking theory has postulated that AI is most likely to emerge evolutionarily from the arms race between spamware and spam filtering software. Perhaps instead it will emerge from the arms race between text and music pirates and antipirating software.
At least in this case, the burgeoning AIs might be more literate and attentive to nuances of the human experience.
I think this is a bit overblown.
Unless the folks that programmed this system are complete idiots, there will be safe guards. Back when I was doing research in latent semantic analysis and other types of essay scoring, we were asked to do copyright searches as well.
Lots of companies, like Turn It In and otherwise that are there just for plagiarism -- not just copyright infringement, do it. *MUCH* harder to look for plagiarism because it isn't just looking for HUGE chunks of data, but tiny fragments and seeing if they are properly quoted.
I never dealt with the plagiarism aspect myself because of the complexities (I was focused on a number of other complexities in other areas), but simple copyright checks -- that was *VERY* simple. A competent programmer could set the bar pretty damn low and it would still block out 99.9% of the false positives. Someone that actually cared to do it right would make pretty damn sure that even this was too much (which was why we flagged these for human raters to check personally instead of dealing with anything else).
All in all, I don't think any system, human, computer, automated, verified and validated would be good enough for you Cory. Personally, I like the CC and GPL and love the idea, but damn...some people just don't want THEIR works placed into the public domain. This should be respected at all costs. You and the EFF want nothing more.
If an author is so much of a jackass they want to sue people who love their works, so be it...for those that care, the respect level of the author will be in the community and people will deal with them accordingly. I know I have several authors that are damn fine writers that I will never buy any of their works because of their personality and morality (heh! I know folks in my industries both academic and creative that feel the same about me...I'm just happy that folks know what is expected if they deal with me one way or the other).
So why is it so hard to respect creators rights? I understand someone not respecting your CC licensing is wrong -- but what about all the others that want their works protected.
#10 - "... at all costs."
Are you sure?
I mean, if this were a work of fiction, your next several months would be spent finding out what "at all costs" means, the hard way...
Clif@10: Are you really, truly sure about "at all costs", and that it's that easy? Let me pose you a non-hypothetical question. We have two works, A and B. A was provably written before B. B is absolutely, 100%, byte-for-byte identical to A. Is B a copyright infringement?
#12 -- that doesn't mean a damn thing. Not sure what you are trying to imply.
I'd assume there would be some meta-data along with the item denoted as the copyrighted master. As I said before, any COMPETENT programmer would expect this.
Lots of public domain works are out there. Lots of CC'd works. Hell, I got tripped up in a class that had a prof running our works through an online plagiarism filter (without the prof knowing I was on a committee that was studying these and it was clearly stated that professors / schools were not to use these until we had a better idea). Not only was my work entered into this system without my permission, it also tripped me up because it claimed that I had plagiarized from an 'online resource' -- the funny thing was, it was an article I had written, was not supposed to be online (I only assigned reprint / non-electronic rights because I had intended to public this myself on the web with a bit more interactivity and wanted to be able to deal with my own versioning control...not allowing someone to keep an outdated version available to others).
Anyhoo...it was a bit embarrassing until I got more information and several accusations. My work was near 100% identical and it wasn't copyright infringement. I had to call the company throwing the accusations (I had briefly met the president because of my committee work) and let him know I needed ALL of my works removed from the system and when he said "FAIR USE" I asked him if he wanted to test this in court...took my lawyer sending over a notice, but it got the message across.
So, no, to answer your question that I have a feeling was supposed to be some lame attempt at a gotcha, no a 100%, byte-for-byte copy is not always copyright infringement.
And again, this is why only an idiot would automate the system without any human intervention.
Clif, I hear a lot of people theorizing that a copyright filter "with safeguards" is possible, but precious little that stands up to scrutiny.
I'd say that in order for this statement to be true, the filter would have to:
1. Not require a priori knowledge of all copyrighted works (otherwise it provides no assurance that it will stop copyright infringement)
2. Permit works that are lawful (works that quote, for example)
3. Work even when the copyrighted works are changed (re-encoded, for example, using equivalent characters in UTF-8)
4. Not simply move infringement around (Scribd is only one of millions of places where you can find the same pirated ebooks; adding the filter doesn't add it to those places; adding it to those places in intractable)
It's fine to say, "Authors should be able to choose whether or not their works are posted," but that's a little like saying, "People should be able to choose whether they get old or not, or whether gravity makes their feet hurt." It's a nice sentiment, but what, specifically, do you think should be done to make it a reality?
Remember, we're not talking about whether authors should have the right to sue (they should) or whether authors should have the right to demand removal of their works (they should).
We're talking about whether the most powerful communications tool in history should have a series of filters on it that block whole swaths of text from being posted to it -- not copyrighted texts, but works that have suspicious algorithmic similarities to copyrighted works.
Every time we add a filter to the net, we make speech a little harder, just by raising the cost of setting up a service like Scribd by establishing the norm that such a service should have filters in place.
If we're going to raise the cost of doing good, then we should have *some* appreciable effect on the practice of doing evil. Scribd's filter isn't going to prevent committed pirates from doing their thing (not least because Scribd isn't the major place for posting copies, and because the advantage will go to the attacker in a system like this).
Here's my analogy for what's gone on with SFWA and Scribd. Some SFWAns went on a tear, declaring Scribd to be a major bad actor because they didn't have imaginary filters in place and because they wouldn't remove works until you filled in the standard "Please remove this" DMCA form. These SFWAns declared Scribd to have weapons of mass destruction and demanded that they turn them over in the form of some notional and impossible system that would stop copyrighted works from showing up in the first place. Remember, this started with SFWA demanding that Scribd end piracy by adding a checkbox to its upload screen that said, "I am not a pirate."
*Arrr, mateys, we'd best be turning back, there be checkboxes ahead.*
Now that we're fighting the "War on Piracy," we're finding ourselves mired in the same kinds of feel-good security theater that the War on Terror gave us: checkboxes, ineffectual filters, etc. This is the copyright policing equivalent of taking off your shoes and surrendering your liquids (and we hear the same silly reasoning for it: "Well, it'll stop *some* pirates," and "At least they're doing *something*).
Security theater is never productive. It just validates impossible ideas like "we can prevent copying on the Internet," and delays the day that the people who espouse these ideas have to confront reality.
"We're talking about whether the most powerful communications tool in history should have a series of filters on it that block whole swaths of text from being posted to it."
Uh oh, I sense that someone out there is about to post the free market objection: "If you don't like it, go somewhere else!" Watch out, Cory! :)
15: Elsewhere than the Internet? ;)
#13: Exactly. But no, there won't be any copyright-master meta-tag. Think about it. If there were, and the filter passed anything with that tag, then wouldn't the pirates simply begin adding that tag to their pirated copies? Yes it'd be against the rules, but then if the pirates were interested in following the rules they wouldn't be pirates now would they. And what else would a filter have to look at but the two copies of the work? It has the copyright-holder-filed work A in it's database, and the newly-submitted work B it's looking at. If it can't assume that B being absolutely 100% identical to A means a copyright infringement, when could it ever flag an infringement?
Cory, just tossing out an idea here. Possibly a bad one, but ...
How about a requirement from places that accept uploads for massive public download while at the same time not taking any responsibility for possible (c) infringement, like Scribd and YouTube:
* They should require something off the uploader to identify him or her. IP address is of course the most immediate possibility. Credit Card number is another.
* With no such requirement, the site itself must take full responsibility for (c) infringing material.
* In return, the process of taking a work off the site for (c) infringement should be made a lot harder, along the legal lines you suggest.
The cost would be that of perfect anonymity no longer being possible. How high is that cost?
With regard to that copyright infringement checkbox notice, I feel I must post a relevant link to the second geekiest* webcomic I read:
The technology really is designed this way
Matt Skala has also written an excellent essay on the fundamental folly of DRM, entitled What Colour are your bits?
Svein: Splendid idea! I suggest you post your credit card number here, as evidence that your idea is your own and not somebody else's copyrighted material! In other words, no thanks.
* next to XKCD.
Svein@18: How high is the cost? Ask Publius, the pseudonym the Federalist Papers were published under.
But everybody knows, and knew then, that the Federalist Papers were written by Alexander Hamilton, James Madison, and John Jay. It was an open secret to start with, and the 1792 edition had their names on the front page. They weren't subversive pieces of sedition being written against a standing government, they were a work of advocacy in an open political arena.
Publius was just a pen-name in the spirit of "Nicolas Bourbaki", to convey an impression of unity to the papers, and to diffuse criticism of the individual authors that would have distracted from debate of the subject matter.