-  [WT]  [PS]  [Home] [Manage]

[Return]
Posting mode: Reply
  1.   (reply to 5435)
  2. (for post and file deletion)
/pr/ - Programming
  • Supported file types are: C, CSS, DOC, DOCX, GIF, H, JAVA, JPG, PDF, PNG, SVG, SWF, TXT, WEBM
  • Maximum file size allowed is 10000 KB.
  • Images greater than 200x200 pixels will be thumbnailed.
  • Currently 429 unique user posts. View catalog

  • Blotter updated: 2018-08-24 Show/Hide Show All

We are in the process of fixing long-standing bugs with the thread reader. This will probably cause more bugs for a short period of time. Buckle up.

Movies & TV 24/7 via Channel7: Web Player, .m3u file. Music via Radio7: Web Player, .m3u file.

WebM is now available sitewide! Please check this thread for more info.

Using text recognition software on website archives? Neckbearded Basement Dweller 20/12/05(Sat)17:31 No. 5435
5435

File 160718589761.jpg - (41.84KB , 640x480 , 1607051047913.jpg )

Null of Kiwi Farms recently stated that he'll release archive of the site minus user data when section 230 is repealed, in torrent form. Does anyone have any advice on automating text recognition software on that archive so I can build a database of hashes for comparison to other sites?

I'd love nothing more than for every Kiwi Farms user to be doxed and sued into oblivion.


>>
Neckbearded Basement Dweller 20/12/06(Sun)05:30 No. 5436

What are you going to make hashes of? Writing patterns? You're going to have to develop an algorithm for that yourself, because nothing like that exists.


>>
Neckbearded Basement Dweller 20/12/10(Thu)07:09 No. 5437

>>5436
https://en.m.wikipedia.org/wiki/Stylometry

https://www.schneier.com/blog/archives/2013/01/identifying_peo_3.html

I keep forgetting I need to spoon feed you pureed takes in baby proportions and at room temperature to prevent tantrums.


>>
Neckbearded Basement Dweller 20/12/12(Sat)16:38 No. 5440

>>5437
Well. Fuck you, buddy.


>>
Neckbearded Basement Dweller 20/12/14(Mon)10:57 No. 5441

>>5440
I'm not your buddy, pal.


>>
Neckbearded Basement Dweller 20/12/24(Thu)23:45 No. 5443

>>5435
>Null of Kiwi Farms recently stated that he'll release archive of the site minus user data when section 230 is repealed, in torrent form. Does anyone have any advice on automating text recognition software on that archive so I can build a database of hashes for comparison to other sites?
Why would he do that? :-/

>I'd love nothing more than for every Kiwi Farms user to be doxed and sued into oblivion.
Make 7chan Great Again!


>>
Neckbearded Basement Dweller 21/01/01(Fri)15:37 No. 5444

>>5443
>Why would he do that? :-/
He said he'd do it so there'd still be archives of their work after he shuts the site down when/if section 230 is repealed.


>>
Neckbearded Basement Dweller 21/01/03(Sun)05:47 No. 5445

Tony Robbins also said, 'You shouldn't trade time for money, you should trade money for money', which in our economy makes him a liar and a capitalist.


>>
Neckbearded Basement Dweller 21/01/17(Sun)11:57 No. 5447

>>5445
>Tony Robbins also said, 'You shouldn't trade time for money, you should trade money for money', which in our economy makes him a liar and a capitalist.
Typical dumb shit from someone who has money already, and whose time is valuable. Most of has have no money and our time is worthless, so we trade for it.


>>
Neckbearded Basement Dweller 21/10/10(Sun)14:31 No. 5487

this would be very comfy


>>
Jones 21/11/11(Thu)09:16 No. 5502

When I need any software, I ask for help from specialists, since I do not have time to understand all this. If you are interested, then on the https://8allocate.com/dedicated-teams/, you will be able to read more detailed information. Maybe you can find help here to build the software you need. Have a nice day.


>>
Neckbearded Basement Dweller 21/11/17(Wed)11:45 No. 5505

I like cold coffee any time of the year


>>
Neckbearded Basement Dweller 21/12/10(Fri)10:19 No. 5523

>>5435
>Can't tell which end of the bell curve you're on, but it's not the middle
Typically, hashes are unique for any input. In fact, they are designed for it to be statistically impossible to find 2 things with the same hash. Hardly useful for identifying the same author of different texts. However, there is a concept of locality-sensitive hashing which might be useful or unnecessary for you.

What you want is some machine learning model that takes in text and spits out a lower-dimensional vector describing the text (analogous to a convolutional net for images). Then, you can apply locality-sensitive hashing to collapse similar vectors to vectors you use as identities. For outliers, you would even be able to say % chance of each identity by calculating a simple projection in each.

However, if you do the machine learning properly, you won't need hashing; it's just a way to calibrate the model after the fact if you fuck up.

Now, for how to create and train the model, the first step is data. Ideally, you would have a large, diverse corpus of text labelled by author. If not, this is easy enough to create by scraping the web.

This problem is very similar to facial identification (different from facial recognition, which would be like a program that decides whether text was produced randomly or intelligently, not who produced it) in that both are solved by transforming the data into a vector of features describing the data which can then be compared to each other in the resulting metric space, and that both fundamentally deal with identity. Ie, the principle component of the facial metric space probably corresponds to gender. By comparing how white, how black, how asian, how fat, how thin, how masculine, how feminine, etc, the faces in multiple pictures are, you can tell which ones are probably of the same person. It is the same process for identifying the author of a text. The text has analogous features like syntax, verbosity, vocabulary, tone, etc, when taken together, can identify an author.

Anyway, set up a model that has an appropriate input size for your data and and your best guess at how many features you need for output (you will tweak this until you stop seeing improvement). Cost function should be (euclidian proportional, but more computationally efficient) distance of output vector from average of outputs for the same author minus the sum of the distances from each of the other authors' averages, normalized. If it doesn't work after completing training, mess with the output size and try again. Once it's as good as it'll get, optionally apply locality-sensitive hashing for maximum effort.



[Return] [Entire Thread] [Last 50 posts]



Delete post []
Password  
Report post
Reason