Large stack of books

12/11/2023

The fraction of characters in duplicate word 5grams. It is also ensured that characters in overlapping ngrams are only counted once. This operates on the lower-cased, punctuation removed content. The fraction of characters in duplicate word 10grams. This is calculated using the regular expression r'\b+*'. The ratio between the number of uppercase letters and total number of characters in each line. The following set of unicodes are considered a bullet point: \u2022 (bullet point), \u2023 (triangular bullet point), \u25B6 (black right pointing triangle), \u25C0 (black left pointing triangle), \u25E6 (white bullet point), \u25A0 (black square), \u25A1 (white square), \u25AA (black small square), \u25AB (white small square), \u2013 (en dash). Whether the lines that start with a bullet point symbol. The ratio between the number of numerical characters and total number of characters in each line. This is computed based on the normalised text. The number of occurrences of the word "javascript" in each line. A terminal punctation mark is defined as one of: ".", "!", "?", "”". Indicates whether a line ends with a terminal punctuation mark. Rps_lines_ending_with_terminal_punctution_mark The number of words in the content after normalisation.

This measures the diversity of the content and is computed using sum(-x / total * log(x / total)) where the sum is taken over counts of unique words in the normalised content. The entropy of the unigram distribution of the content. Calculated based on the normalised content. This is also known as the degeneracy of a text sample. The fraction of unique words in the content.

The ratio of symbols to words in the content. Stop words are obtained from the stopwords-json repo. The ratio between the number of stop words and the number of words in the document. The mean length of words in the content after normalisation. The ratio between the number of occurrences of 'lorem ipsum' and the number of characters in the content after normalisation. The fraction of words that contain no alphabetical character. The fraction of lines that end with an ellipsis, where an ellipsis is defined as either "." or "…". The fraction of words in the content that only consist of uppercase letters. In $' and the number of characters in the raw text. The most recent list of blacklisted urls from the UT1 blacklist.Īs a first step, download the english wikipedia reference classifier This includes building qualityĬlassifiers, training bag-of-ngram generative models for importance weight computation, fetching the list of bad words This part of the pipeline creates the artifacts that are used in the subsequent steps. The pipeline is composed of three steps, namely 1) preparing artifacts, 2) computing quality signals, and 3) Have a docker and apptainer installation. You can run the steps of the pipeline without any containerized environment. To run with docker, build the docker image usingĪlso, make sure you have s5cmd installed and your S3 profile configured so that you can pull data from an S3 bucket. These will be used throughout the pipeline. configs/nf and configure the environment variables. The number of documents and tokens for the annotated and deduplicated head_middle part of the dataset is shown in theĮnglish, German, French, Italian, Spanish Setup ConfigurationĬopy the file configs/rp_v2.0.conf to e.g. Document and Token Counts for the Annotated and deduplicated head_middle part of the dataset That additionally come with quality signals, and 20B documents that are deduplicated. Out of these, there are 30B documents in the corpus The dataset includes over 100B textĭocuments coming from 84 CommonCrawl snapshots and processed using RedPajama-V2 is an open dataset for training large language models. RedPajama-1T dataset, please refer to the rp_v1 branch in this repo. For more information on the dataset, check out ourīlog post. This repository contains the code for the RedPajama-V2 dataset.

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

0 Comments

Large stack of books

Leave a Reply.

Author

Archives

Categories