Brain Cancer in NeuralCrawler
Jan. 11th, 2017 06:23 amToday Andrey and I discovered that NeuralCrawler we created - got brain cancer: Out of 844,467 pages - 99.5% is useless junk from 2 sub-domains: "boystown.giftlegacy.com" and "boystowngift.org"
So far we attribute the cause of that cancer spread to a couple of bugs:
1) Creating extra links with every redirect (unfortunately problematic domains generate links with random sessionId and then redirect from one to another).
2) Not deleting old page links after reparsing page content.
So far we attribute the cause of that cancer spread to a couple of bugs:
1) Creating extra links with every redirect (unfortunately problematic domains generate links with random sessionId and then redirect from one to another).
2) Not deleting old page links after reparsing page content.