depth-crawler is aThree-level page-directed crawling tool.

Function

1. Use the scrapy framework to crawl to the third-level page in python, and save the content and link of the html page as an xlsx table

2. Save the content of the xlsx table to elasticsearch

3. Use ik participle when querying elasticsearch

4. Use the flask framework to write the front-end page in python to present the search page and search information content

5. The query result is highlighted

Installation Notes

First install python for editing the code, then install the jdk environment for subsequent installation of elasticsearch, and the elasticsearch database processes the crawled data. npm is a package management tool installed with node.js for node.js plug-in management (including installation, uninstallation, management dependencies, etc.), elasticsearch-head is a client-side plug-in used by elasticsearch to monitor the status of Elasticsearch, including data visualization, Execute addition, deletion, modification, and search operations, etc., which are developed using JavaScript and depend on Node.js

1. python (3.8.10)Open cmd and enter python, there is a version number indicating that the installation is successful

2.jdk(1.8.0_241)Pay attention to the configuration of environment variables, open cmd and enter java -version, there is a version number indicating that it is installed as

3. elasticsearch (6.8.21)After finding elasticsearch.bat to start successfully, enter in the browser (localhost:9200) appears as shown in the figure and is installed as

3.1.nodejs (v16.17.0)After finding elasticsearch.bat to start successfully, enter in the browser (localhost:9200) appears as shown and the installation is successful

3.2.elasticsearch-head (6.8.21) is consistent with the elasticsearch version (Installation and basic use of the head plugin) Open the command line through elasticsearch-head-master and enter (grunt server)

4. Extension library part – pip install library name == version number (you can enter pip list in cmd to view all version numbers)

4.1. flask (2.1.2) – a framework for writing web applications using Python

Enter (pip install flask) in cmd, if you specify the version number, enter (pip install flask==2.1.2)

4.2.scrapy (2.6.1) – used to crawl website data and extract structural data

4.3. elasticsearch (7.15.2) – for searching information

4.4. pandas (1.4.1) – for processing tabular data

4.5.openpyxl (3.0.9) – can be used to read and write excel tables

4.6.XlsxWriter(3.0.3) – used to create Excel XLSX files

start process

elasticsearch:

1. Open the “bin folder” under the “elasticsearch” folder and double-click “elasticsearch.bat” to start running

2. Open the “elasticsearch-head-master” replication path under the “head” folder in the “elasticsearch” folder (D:\\ES\\elasticsearch-6.8.21\\head\\elasticsearch-head-master ) Open cmd and enter the command in the path (grunt server)

Effect browsing

elasticsearch:

1. Open the browser and enter (http://localhost:9200/) access port

2. Open the browser and enter (http://localhost:9100/) to see the information in the elasticsearch database

front end :

1. Run (route.py) to access the first route in the browser (http://127.0.0.1:5000/search) to see the search page

2. Enter the search content (such as: textile) in the search box, relevant information will appear, and there will also be a paging effect at the end

3. Click the title and content part to return to the original URL

4. Click on the snapshot to go to the html page

5. Each piece of content will increase its ranking according to the number of clicks, enter (http://127.0.0.1:5000/restore) rankings are restored

#depthcrawlerpython #homepage #documentation #download #threelevel #page #orientation #crawling #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *